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Abstract 

This thesis examines the design of geographically centralized high performance packet switched 
networks called routing networks. Kach of these networks is intended to be used to interconnect the 
modules of a highly parallel computer system. The design of such networks is considered in present 
(1984) technology where only a small number of network nodes can be placed on a single chip and in 
VLSI technology where a large number of nodes can be placed on a chip. 

In both technologies, the design of routing networks for uniform patterns of communication is 
considered. In each technology, it is shown that the characteristics of these patterns imply a minimum 
cost for networks capable of supporting them. In present technology, the performance of a particular 
network that is well suited for uniform communication, the indirect n-cubc routing network, is studied. 
The strongest constraint on the performance of the indirect n-cubc network that is found still allows the 
the throughput of the network for uniform patterns of communication to grow linearly with the size of 
the network. In VLSI, the use of networks such as the crossbar and the indirect n-cube to support 
uniform patterns of communication is considered. 

The design of routing networks for a few localized patterns of communication is briefly considered 
in both technologies. In each technology, networks that arc well suited for these localized communication 
patterns arc discussed. 
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1. Introduction 

1.1 Overview 

Data communication considerations arc becoming increasingly important in the design of high 
performance computers. Conventional single sequence computer designs have been refined to the point 
where only limited further improvements in performance can be expected without a dramatic 
improvement in the speed of available circuits. Levels of performance significantly higher than those of 
present computers can only be achieved by machines capable of substantial concurrency. To achieve 
high concurrency a machine must support a large flow of data and control signals. As we shall see below, 
a class of digital systems which seems well suited for the implementation of such machines in light of their 
communication requirements is the class of packet communication systems [6]. A packet communication 
system is composed of a number of subsystems interconnected by one or more communication networks 
called routing networks. Data transfer between two subsystems is accomplished by passing a packet over 
a network path fiom one to the other. This thesis examines several aspects of routing network design. 

1.2 Packet Communication Systems 

By definition, any digital system with the following properties is a packet communication system. A 
packet communication system is composed of modules and a set of links where each link connects one or 
more modules. A module can be any form of digital system, and may be capable of storing data and 
performing various operations on that data. The transmission of data from one module to another can 
only be accomplished by passing a packet along a path of connected intermediate modules. Thus the 
behavior of a module can depend only on its internal state and packets it receives from modules 
connected to it. 

Packet communication systems seem well suited for implementing concurrent computers that need 
to support a large How of data. Consideration of data communication enters at a very early stage in the 
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design of a packet communication system. In particular, the design of the component modules and the 
manner in which they are interconnected should reflect the inherent structure of the computations to be 
performed by the system. liy patterning the structure of the system to the structure of the problem the 
designer can develop a system which supports the required flow of data with a minimum of hardware. 

In addition, data communication by packet passing as done in packet communication systems can 
facilitate the efficient use of data paths. This is particularly true for systems such as the Dennis data flow 
machine [7] that require only short messages to be transferred among their component units. For these 
machines packet communication has advantages over the alternative circuit switching. In circuit 
switching a complete path between two units must be set up before a message can be transferred from 
one to the other and the entire path must remain allocated for die duration of die transfer. In a packet 
v.oiuuuuiii.uiioii system a message is uaiisfeired in die form of a packet that is passed from module to 
module along its desired path. Thus, the message transfer only requires one link at any given time. Only 
the next link in a packet's desired padi need be available for die packet to proceed. 

1.3 Routing Networks 

A routing network is by definition a packet communication system with designated input and 
output links (as shown in Figure 1) that has the ability to accept tagged packets on each of its designated 
inputs and to route each packet to the output corresponding to its tag. A routing network may be 
connected by its input and output links to other packet communication systems and thus provide 
intercommunication among diem. A routing network as a packet communication system is composed of 
packet communication modules diat are interconnected by links. For the purpose of discussion, the 
internal modules of a network will be called nodes. 

Routing networks can be used to interconnect several small packet communication systems into a 
larger packet communication system. In such a system, the routing networks handle the required transfer 
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Fig. 1. Example Routing Network 
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of data and control packets among the various subsystems. Thus, an important part of the design of large 
packet communication systems is die design of their routing networks. 



Routing networks arc packet networks; data is transferred in the form of packets that arc passed 
from node to node in the network. Since routing networks arc also packet communication systems and 
since the control of a packet communication system can not be centralized, the routing of packets must be 
done in a decentralized fashion. The routing decisions made by a node for a given packet can depend 
only on die state information of the node and the label of the packet. There arc, however, a number of 
differences between these networks which arc intended for use in localized high performance systems and 
distributed packet networks such as the ARPA network. In contrast to die ARPA network where the cost 
of data paths dominates, both the cost of data paths and die cost of nodes must be considered in die 
design of a routing network. Thus, in the design of a routing network it is important to minimize the 
complexity of network nodes. Very large buffers for queuing packets or very large tables for storing 
routing information cannot be used. Unlike die ARPA network, a routing network has data paths and 
nodes of comparable speed and reliability. This suggests that a simple routing algorithm should be used 
by each node in a routing network in order to minimize the total time required for a packet to pass 



through the network. 

Routing networks differ from the majority of networks that have been studied in classical switching 
theory. While routing networks use packet switching, switching theory has been primarily concerned 
with networks that use circuit switching. Further, in contrast to routing networks that use decentralized 
control, most of the networks that have been studied in switching theory use centralized control. Finally, 
switching dicory has assumed that the cost of wires is negligible in comparison to die cost of active switch 
elements. This assumption, as we sec below, is not valid for some of die technologies that may be used to 
implement routing networks. Thus, while classical switching dicory provides a good storting point for 
research on routing networks, most of the results that have been obtained do not directly apply to routing 
networks. In this diesis, we examine cost and performance issues for routing networks diat are similar to 
the cost and performance issues that have been studied for circuit switched neiwoiks in classical switching 
theory. We will use cost measures that are appropriate for present and future integrated circuit 
technologies, and performance measures that are appropriate for the intended network applications. 

1.4 Research Topics 

This thesis examines the design of routing networks under two different sets of assumptions that 
correspond to two points on the apparent path of integrated circuit technology evolution. One set of 
assumptions corresponds to present technology where only a small number of network nodes can be 
implemented on a single integrated circuit. The other set of assumptions corresponds to a technology 
where a large number of network nodes can be implemented on a single integrated circuit. 

In 1984 technology, consideration of the length of wire needed to implement a given link seems 
unimportant. It seems unlikely diat more Uian a small number of modules can be implemented on a 
single integrated circuit. Finks between modules can be implemented for the most part as printed circuit 
board wires. The length (as opposed to the number) of such wires is not a significant factor in the total 
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cost of the system. Similarly, the clTcd of wire length on system speed is small since even the propagation 
time of a long wire is less than the delay of the circuit required to drive it from an on-chip signal. 

In Chapter 2, wc examine the design of routing networks for present technology. Wc assume that 
the length of wire required to implement each network link docs not affect cither the cost or the 
performance of the network. Further, wc place certain additional restrictions on the behavior and 
complexity of network nodes in order to narrow down the number of design parameters that must be 
considered. These restrictions, which arc described in Section 2.2, arc motivated by the current state of 
technology and the nature of the systems in which the networks may be used. 

Given these restrictions, we examine in Section 2.3 the design of networks for a class of systems 
characterized by uniform communication; each source of packets in such a system generates packets for 
all the possible destinations and over the long run generates a comparable number for each destination. 
For the purpose of analysis, wc introduce simple probabilistic models of the packet sources and sinks of a 
system with uniform communication. We examine the minimum number of nodes required by any 
network that is capable of high throughput when it is connected to the model sources and sinks. We 
study a particular network, the indirect n-cubc routing network, that seems well suited for uniform 
communication and has a number of nodes within a constant factor of the lower bound. Below, we use 
the term indirect n-cube network to refer to the indirect n-cubc routing network. 

It should be noted that networks related to routing networks in general and the indirect n-cube 
network in particular have been studied in the literature. Sorting networks, networks capable of sorting 
N data items in parallel where N is the number of network inputs, arc clearly not the same as routing 
networks, but intuition would suggest that these two types of networks have similar complexities. Sorting 
networks using 0(N log N) nodes have been known for some time [4]. More recently, 0(N log N) node 
sorting networks have been described [3], although these networks may not be of practical interest 
because of the very large constant factor. The indirect n-cubc network has a comparable complexity; it 
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uses 0(N log N) nodes. Karlicr work on routing networks with structures similar to that of the indirect 
n-cubc network has been done [5]. Other networks with related structures have been described in the 
literature. Some of these networks have been proposed to perform permutations on large vectors of data. 
In general, these permutation networks have been proposed for systems in which the elements of a given 
vector arc processed synchronously and then permuted. These permutation networks include 
shuffle-exchange, omega, Pease's indirect n-cubc permutation network, and cube-connected cycles [24, 
14, 15, 16, 21, 23]. Other networks with related structures have been proposed to interconnect the 
processors and memories of other types of multiple processor systems. Circuit switched and packet 
switched banyan networks have been proposed for single instruction stream multiple processor systems 
[10, 26]. Circuit switched and packet switched delta networks have been proposed for multiprocessors in 
which each processor makes independent and random memory accesses [20, 8]. The relationship between 
the indirect n-cubc routing network and previously studied networks is discussed in more detail in 
Section 2.2.3.1. 

We consider in Section 2.3 the operation of large indirect n-cubc routing networks when connected 
to the model sources and sinks and examine certain important characteristics of the operation of these 
networks. We examine the influence of these characteristics by using network models that accurately 
model these characteristics and that are considerably simpler than the actual network. By analyzing and 
simulating the models, we examine the influence of these characteristics on certain aspects of the 
performance of the networks. The performance predicted by these models is compared to the 
performance of the actual network which we measure by simulation. 

One important aspect of performance that we study is throughput. We examine the relationship 
between network throughput and network si/.c. We would like the throughput of the network to scale 
linearly with the number of network inputs since if we form a composite packet communication system 
by using a routing network to interconnect several subsystems, we would like the performance of the 



system to scale linearly with the number of subsystems. 

Another important aspect of performance that wc study is the speed of slow inputs. If a particular 
network input becomes extremely slow due to network congestion then packets from the module 
connected to that input can be delayed. If the congestion continues for a long period, a large number of 
modules that arc cither directly or indirectly dependent on the blocked module in a highly parallel 
computation can be affected. 

Our study suggests that very large indirect n-cubc networks can support high performance for 
uniform communication patterns. The strongest constraint on network throughput that we find in our 
study still allows throughput to grow linearly with network size. However, our study also indicates that 
some of the inputs of a very large network can be slow for a very long period of time. 

Wc also briefly examine in Section 2.3 the design of routing networks for a class of systems 
characterized by localized communication; the majority of packets generated by a particular source in 
such a system arc tagged for a small group of destinations. Many localized communication patterns can 
be supported with networks that are less complex than the indirect n-cubc network. We discuss one 
obvious family of networks that seem appropriate for some important localized communication patterns. 
One of the characteristics of networks of this family is a number of nodes equal to the sum of the number 
of network inputs and the number of network outputs. This family includes grid structured networks and 
tree structured networks. 

In Chapter 3, wc examine the design of routing networks in the technology of five to ten years from 
now. 

As technology changes and the number of network nodes that can be placed on a given integrated 
circuit increases, the importance of wire length in network cost will increase. Within a few years it should 
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bc possible to implement many modules and the links that interconnect them on a single chip. The 
length of wire required to implement each link will then be a significant factor in the cost of silicon 
required to implement a system, since the chip acreage used by each wire is proportional to its length. 
However, it is likely that for the immediate future, the next five to ten years, even long wires can be 
driven quickly if drivers of the appropriate size are used. 

In Chapter 3, we make some assumptions about the characteristics of the integrated circuit 
technology of the next five to ten years and study tine design of routing networks under these 
assumptions. For the purpose of discussion, we refer to the technology that will exist at the end of this 
period as very large scale integration (VLSI). We describe a model of VLSI based on some assumptions 
about the characteristics of VLSI. We examine in the VLSI model the fundamental cost of a single chip 
network, to support a certain level of pcifonnance for unifonn patterns of communication. We examine a 
few structures that seem appropriate for implementing a single chip uniform communication network in 
VLSI. These structures include a crossbar structure, and an indirect n-cube structure. We discuss a 
technique for interconnecting such single chip networks to form larger uniform communication networks. 
We also briefly examine the design in VLSI of networks for localized patterns of communication. We 
examine a few example network structures and describe the communication patterns diat they can 
support. 

1.5 Notation for Asymptotic Bounds 

We use the following notation to describe asymptotic bounds. 

We say that f(N) is fl(g(N)) if and only if there exists Nq and c greater than zero such that f(N) is 
greater than or equal to cg(N) for alt N greater than Nq. 



Wc say that f(N) is 0(g(N)) if and only if there exists Nq, c'j, and Cj greater than zero such that 
rjg(N) < ('2g(N) for all N greater than Nq. 

Wc say that f(N) is 0(g(N)) if and only if tlicrc exists Nq and c greater than zero such that f(N) is 
less than or equal to cg(N) for all N greater than Nq. 
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2. Design of Routing Networks Ignoring Wire Length 

2.1 Network Restrictions 

In general, we will be concerned only with routing networks that obey the restrictions described in 
the following paragraphs. 

VVc assume that time is divided into units and that any operation in die network starts on a 
transition between two time units and finishes before the next transition. It is important to note in 
passing that this assumption of purely synchronous behavior is only for die convenience of analysis and 
that the networks that we shall present can easily be designed to function asynchronously. 

We assume that there is some limit k on the total number of input and output links that can be 
attached to a particular network node, and we assume that there is a limit o on the total number of 
packets that may be buffered at any particular network node at any particular time. These restrictions are 
motivated by our desire to bound the amount of chip space and number of external connections required 
to implement a network node as a portion of an integrated circuit. 

For the most part, we assume diat only two nodes arc connected to a given link. The only exception 
is our discussion of the minimum cost of a network to support high throughput for uniform patterns of 
communication. In diat discussion, wc get a more general lower bound by assuming that an arbitrary 
number of nodes can be connected to a given link. In all cases, wc assume diat one link can transfer at 
most one packet per unit time. 

In the case that only two nodes arc connected by a link, wc assume diat the two nodes observe a 
ready/acknowledge protocol for transferring packets. In particular, each link contains acknowlegc and 
ready control lines in addition to die data lines used to transfer the packets as shown in Figure 2. The 
protocol has 4 phases as shown in Figure 3. A sending node can place a new packet on the data lines of 



- 15 



Fig. 2. Lines of a Link 
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the link only if all 4 phases of the transmission of the previous packet have been completed. The sending 
node asserts the ready line after placing the packet on the data lines and will continue to assert the ready 
and data lines until the acknowlcge line is asserted by the receiving node. The receiving node can accept a 
packet available on the data lines only after the ready line has been asserted, and the receiving node will 
assert the acknowlege line only after it has safely stored the packet in one of its buffers. This protocol was 
chosen since it is widely known and provides in a straightforward manner the necessary coordination for 
packet passing. 

We assume that the behavior of a node during a given time unit can depend only on the 
information available on its links and the contents of packets stored in its buffers, and we assume that the 
node's behavior can depend only on the destination label and not the data portion of any packet stored in 
the node. These restrictions arc in part motivated by our desire to implement each network node as a 
portion of an integrated circuit. By limiting die complexity of each node's behavior we limit the amount 
of space required to implement the control circuits of each node. These restrictions arc also motivated by 



our desire to minimize me time required for a packet to traverse the network since by limiting the 
complexity of the control algorithm of each node we indirectly limit the time required by each node to 
process a packet. These restrictions seem natural and can be easily observed in the design of networks. 
However, it should be noted that dicrc arc many alternative sets of restrictions that could be placed on the 
node's behavior, and that a different set of restrictions might well lead to different network structures. 

Finally, we assume that a packet that enters a node in a given unit of time can not leave that node 
until the next unit of time, and that a link can transfer at most one packet per unit time. Although we are 
ignoring the time required to propagate a packet over a link, we are not ignoring the time required to gate 
a packet through a node or store it in a buffer. 

2.2 Networks for Systems with Uniform Communication 

2.2.1 Model of the Problem 

In general, a routing network will be used as shown in Figure 4 to connect a group of packet sources 
to a group of packet receivers. Each source will produce labeled packets and each packet must be 

Fig. 4. Use of a Routing Network 
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delivered to the receiver which corresponds to its label. 

We describe here a simple model of a system with uniform communication. For our model, wc 
assume that the number of receivers equals the number of sources and that receivers and sources behave 
in a very simple manner. Wc assume that a receiver will process a message as soon as it arrives and can 
thus accept a packet every unit of time. We assume that if the network has accepted the last packet 
generated by a particular source, then the chance that the source will produce a new packet in a unit of 
time is P where P is a parameter of the model. Wc assume that the label of each packet is independently 
selected, and that all of the possible receiver labels arc equally likely. 

2.2.2 Minimum Network Cost 

in tnis section, we find a lower bound on the complexity of any N-input N-output routing network 
capable of fl(N) average throughput when each of its inputs is connected to a model source and each of 
its outputs is connected to a model receiver. Wc measure throughput as packets per unit time and 
complexity as the number of nodes in the network. Wc show that such a network requires S2(N log N) 
nodes. This result gives some motivation for the fact that the network that wc study for such applications, 
the indirect n-cubc network, and most related networks require fl(log N) suiges of fl(N) nodes. 

For the purpose of this discussion, we allow an arbitrary number of nodes to be connected to a 
given link. As before, wc allow only one packet to be transferred over a given link in a given unit of time. 
Clearly, the lower bound on network cost that we obtain by allowing an arbitrary number of nodes to be 
connected to a given link also holds if only two nodes are allowed to be connected to a given link. 



Proposition. fl(N log N) nodes arc required by anv N input N-output routing network capable of 
fl(N) average througbput when each of its inputs is connected to a model source and each of its outputs is 
connected to a model receiver. 

Proof. Since we assume diat a network node can be connected to at most k links, wc can get a lower 
bound on die number of network nodes by finding a lower bound on the sum, over all links in the 
network, of the number of nodes connected to each link. For the purpose of this discussion, wc use the 
term connection to refer to the juncture between a network link and a network node. Thus, the sum, over 
all links in the network, of the number of nodes connected to each link is equal to die number of 
connections in die network. We get a lower bound on the number of connections in the network by 
considering the sum, over all the packets processed during along period, of die number of connections 
used by each packet and by making use of the fact dial a connection can be involved in at most one 
operation per unit time. 

A lower bound on the number of connections in the network can be obtained by considering the 
operation of the network over a long period of time and examining the use of network connections during 
such a period. Let us consider a period of T time units for some very large T. Since we assume that only 
one packet can be transferred on a link in one unit of time, it follows that a connection can be used for 
only one packet in a given unit of time. The total number of connections in the network must be at least 
as great as (1/T) times the number of connection operations during die period where the number of 
connection operations is defined to be die sum, over all packets processed, of the number of connections 
used by each packet. It should be noted that wc count each connection of a link separately, and that the 
die number of connections used by a packet includes each connection of each link used by the packet. 

A lower bound on die expected number of connection operations during the T time unit period can 
be obtained by considering the possible padis through die network. For die purpose of this discussion, 
wc assume that each network link is logically composed of some number of link segments. In particular, 



wc assume that the nodes connected to a given link are connected according to some linear order, and 
that a link segment is a portion of a link between two adjacent nodes as shown in Figure 5. A link which 
is connected to / nodes has / - 1 segments. By this definition, a connection between a link and a node 
can involve at most two link segments, and at most 2 k link segments in total can be connected to a node. 
Less than 
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outputs can be reached from a particular input by a path containing / or fewer link segments. Thus, there 

are less than N/2 outputs that can be reached from a given input using a path containing no more than 
(log^i- . i) (N/2)) - 1 link segments. Since a model source randomly selects a destination for each packet 
it generates and since all destinations arc equally likely, there is at least a 50% chance diat a packet 
generated by a model source must travel over a path of greater than (logo^ - n (N/2)) - 1 link segments. 

Fig. 5. Conceptual Model of Nodes Connected to a Link 
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k follows that the expected number of link segments used by such a packet must be fl(log N). Since the 
number of nodes connected to a given link is equal to the number of segments in the link phis one, the 
total number of nodes connected to the links in the path of a particular packet must be larger than the 
number of link segments in the path. Thus, a packet generated by a model source is expected to use at 
least fi(log N) connections. Since by assumption the expected number of packets processed in die T time 
unit period is fl(TN), the expected number of connection operations during die period must be greater 
than fl(TN log N). 

A lower bound on die number of nodes in the network follows. Since the expected number of 
connection operations during die T time unit period is fl(TN log N) and since a connection can be 
involved in at most one operation in a given unit of time, the network must have fl(N log N) separate 
connections between links and nodes. Since each nude can be connected to ai. must k links and since k is 
fixed and is independent of N, the network must have fi(N log N) nodes. 

Thus, there must be fi(N log N) nodes in a N-input N-output routing network capable of S2(N) 
average throughput when each of its inputs is connected to a model source and each of its outputs is 
connected to a model receiver. 

This ends die discussion of the proposition. I 

2.2.3 Routing Networks Using an Indirect n-Cube Topology 

2.2.3.1 Introduction 

The network shown in Figure 6, the indirect n-cubc network, seems well suited for applications 
with uniform communication and has a cost of the same order as the lower bound derived in the previous 
section. 



In this introduction, we describe the indirect n-cubc network, we discuss the relationship between 
previously studied networks and the indirect n-cubc network, and wc give a brief overview of our work 
on the indirect n-cubc network. 

An indirect n-cubc network is constructed as shown in Figure 6. A N-input network is composed 
from two N/2-input networks and N/2 nodes (called routers). Hach node has two input and two output 
links. This construction yields an interconnection which is topological^ equivalent to the interconnection 
of butterflies in the radix two fast Fourier transform [11]. In total (N/2) log 2 N nodes arc required. One 
and only one path exists from a given input to a given output. If network outputs, stages, and node 
outputs arc numbered as shown in Figure 7, then at the /th stage the appropriate path follows the node 
output that corresponds to die /th most significant bit of the binary representation of the number of the 

uCSuiiauCii Output. 

Each node of the network can be structured as shown in Figure 8. The node has a fifo buffer 
capable of storing some number of packets on each of its input links. If at the beginning of a time unit a 

Fig. 6. NxN Indirect n-Cubc Network Construction 
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buffer is not full and the corresponding input ready line is asserted then the node control places the 
packet available on the data lines in the buffer and asserts the returning acknowlege line before the end of 
die time unit. If a buffer is not empty at the beginning of a time unit then the node control attempts to 
place the packet which entered die buffer first on the output link corresponding to its destination. If the 
node control can do diis, cidicr because no conflict exists or because of arbitration of the conflict, then it 
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asserts the corresponding output ready line. If an acknowledge is returned on that link before die end of 
the time unit, then the node control removes the ready and removes the packet from the buffer. 
Otherwise, ready and the packet are removed during the first time unit when acknowledge is returned. 

As was mentioned in the first chapter, there arc a large number of networks that have structures that 
are related to that of the indirect n-cubc network. The topology of die indirect n-cubc routing network is 
identical to that of several other networks including omega, s = f=2 banyan, and Pease's indirect n-cubc 
permutation network [16, 10, 21]. Networks with the same topology have also been called delta networks 
[20], The topology of the shuffle-exchange network [24] is also related since an omega network is simply a 
cascade of log2 N shuffle-exchange networks. As was mentioned in the first chapter, these related 
networks have been proposed for a variety of uses. Some of these networks have been proposed to 
perform permutations on large vectuts of daia. In general, die^e pcnnutauou networks have been 
proposed for systems in which the elements of a given vector arc processed synchronously and then 
permuted. Other related networks have been proposed to interconnect the processors and memories of 
other types of multiple processor systems. 

The work on networks for the synchronous permutation of large vectors of data has been mostly 
concerned with the types of permutations that can be realized by a given number of passes through such a 
network, and thus diat work docs not directly address the question of how well such networks perform 
when interconnecting the modules of a packet communication system. 

Some of the work on interconnection networks for other types of multiple processor systems is 
more closely related to our study of the indirect n-cubc network. We discuss a few pieces of tliis work. 
The first is the work of Valiant [28]. Valiant has suggested the use of networks such as the packet 
switched n-cubc for interconnection of processors and memories in a synchronous multiprocessor system. 
Valiant introduces the concept of an idealistic computer composed of processors that operate 
synchronously and a memory that the processors share. He considers algorithms such that no memory 
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location is accessed in a given computational step by more than some constant number of processors. He 
assumes that the idealistic computer can implement each computational step in a single unit of time. He 
considers the simulation of the idealistic computer on a realistic computer with a packet switched n-cube 
network, l-'ach node of the network has the capacity to buffer a number of packets proportional to the log 
of the number of processors. Valiant shows that with high probability the realistic computer can 
implement a computational step in time proportional to the log of the number of processors. While it 
seems plausible that the memory accesses corresponding to several computational steps can be pipelined 
in the realistic computer, Valiant does not show this. Thus, Valiant's work differs from ours in at least 
three ways. First, he considers systems of synchronous processors and we consider systems of largely 
independent asynchronous processors. Second, in each network node he allows buffering proportional to 
the log of the number of processors and we allow only buffering of fixed size. Finally, he does not 
consider the pipelining of packets of different computational steps through his network and we consider a 
continual flow of packets through our networks. 

Upfal [27] has shown similar results for networks of fixed degree. He uses the d-way digit-exchange 
graph. A processor is associated with each network node and each network node is assumed to have 
0(log N) buffers where N is the number of processors. It is assumed that a packet is initially at each 
processor and that each packet is destined for some other processor. No two packets are destined for the 
same processor. Upfal shows that with high probability all the packets can be delivered to their 
destinations in 0(log N) time. 

Patel has suggested the use of circuit switched delta networks for multiprocessors in which each 
processor makes independent and random memory accesses [20]. For his analysis, Patel assumes that 
memory requests in a multiprocessor arc generated in a manner similar to the manner in which packets 
arc generated by our model sources. The primary distinction between our work and that of Patel is the 
fact that our routing networks arc packet switched. In Patcl's circuit switched network, the transmission 
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of a message requires the use of a circuit through all of the stages of the network. In our routing 
networks, at any given time a packet only uses one link to go from its present stage to the next stage. The 
throughput of Patcl's networks do not grow linearly with their size. As we shall sec, there is reason to 
believe that the throughput of the indirect n-cubc network for uniform patterns of communication docs 
grow linearly with its size. 

Dias and Jump [8] have done work on the use of packet switched delta networks as interconnection 
networks for multiprocessors and this work is very closely related to our study of the indirect n-cube 
network. Their network is topological^ identical to the indirect n-cubc. Their analysis assumes that the 
packets and the labels on the packets are generated in a manner similar to the way that packets and packet 
labels are generated by our model sources. They analyze their networks using network models in a 
manner similar to the way that we diiuiy/.e the indited n-cubc network using nctwoik models. However, 
their models differ from ours. They use a Markov model to develop approximate equations for the state 
probabilities of a router in a given stage in terms of the state probabilities of the routers connected to it. 
They simultaneously solve the equations for all the stages. Their analysis makes several approximations. 
The analysis assumes that the routers of a given stage are independent. Also, the analysis of a given 
router assumes that the state probabilities of routers connected to it arc independent of the state and 
history of the given router. Some of die characteristics of network behavior that we study in our models 
violate these assumptions of independence. For modest sized networks, the network throughput 
predicted by our models is consistent with that predicted by their models. However, for very large 
networks our throughput predictions differ from theirs. Since their model assumes more independence 
than ours one might expect it to predict higher diroughput, but in fact the way Uiat their assumptions are 
used in their model leads to a prediction of lower throughput. Their model predicts that the normalized 
diroughput goes to zero as the network size goes to infinity [9] where normalized throughput is defined to 
be network throughput divided by network size. All of the constraints on network throughput that we 
study allow a non zero asymptote. We believe that our study considers all of the constraints represented 
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by tlicir model and several that are not. We believe that the asymptotic prediction of their model is the 
result of the way that their assumptions are used in their model and docs not reflect any real constraint on 
the throughput of the network. Another difference between their work and ours comes from the fact that 
in addition to studying the average network throughput wc also consider the speed of slow network 
inputs. Their work is primarily concerned with the average network throughput and delay. 

Recently, Pippcnger [22] has extended die results of Valiant and Upfal to a network with a fixed 
amount of buffering at each node. In his work, Pippcnger uses the d-way digit-exchange graph. A 
processor is associated with each network node. Only a fixed amount of buffering is assumed at each 
node. It is assumed that a packet is initially at each processor and diat each packet is destined for some 
other processor. No two packets are destined for the same processor. Pippcnger assumes that each node 
obeys ccrtdiii rules cuiKcrniiig the order in which it ptocesscb the packets dial ii ic-cciveb, and shows uiai 
if the rules are obeyed then with high probability all the packets can be delivered to their destinations in 
0(log N) time. 

While there arc differences between Pippcngcr's work and ours, Pippcnger's results arc significant 
and have some bearing on our work. Pippcnger's network differs from the indirect n-cube network; the 
indirect n-cube network would be more closely related to Pippcnger's network if the inputs of die indirect 
n-cubc network were connected to die outputs and if a processor were associated with each network node 
of the indirect n-cubc network. The type of network operation that Pippcnger considers differs from the 
type diat wc consider; Pippcnger docs not consider die pipelining of waves of packets through his 
network and wc consider a continual flow of packets through the indirect n-cubc network. However, by 
establishing certain additional rules for the operation of the nodes of the indirect n-cube it may be 
possible to extend Pippcngcr's approach to provide results on the performance of the indirect n-cubc 
network for uniform communication. The additional rules would concern the order in which a network 
node processes the packets that it receives, and possibly die removal of unusual blockages. It may be 
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possible to show that the normalized throughput of the indirect n-cube network with the additional rules 
approaches a non zero constant. Such a result is plausible, but even if the result holds it may be very 
difficult to prove. In any case, such a result is consistent with our work. Our work considers the 
performance of the indirect n-cubc without rules such as those mentioned above. Hvcn without such 
rules, the strongest constraint on network throughput that wc find still allows normalized throughput to 
approach a non zero constant. 

In the following pages, wc examine the effect of certain important characteristics of the operation of 
very large indirect n-cube networks. In particular, we study in 2.2.3.2 the effect of congestion at a single 
router. In 2.2.3.3, we study the effect of congestion in a single stage of routers. In 2.2.3.4, wc study the 
effect of the interaction of routers of different stages. As was mentioned earlier, wc examine the effect of 
these cliauictci-istics of uetwoik behavior by using network models thai accurately model these 
characteristics and that are considerably simpler than the actual network. By analyzing and simulating 
the models, wc examine the effect of these characteristics on network throughput and the speed of slow 
network inputs. As was mentioned earlier, our study suggests that very large indirect n-cubc networks 
can support high performance for uniform communication patterns. The strongest constraint on network 
throughput that wc find in our study is caused by the interaction of routers of different stages and it still 
allows throughput to grow linearly with network size. However, our study of the interaction of routers of 
different stages also indicates that some of the inputs of a very large network can be slow for a very long 
period of time. 

2.2.3.2 Tree Buffering 

The first characteristic of network operation that wc examine is tree buffering. Wc use die term tree 
buffering to refer to the buffering of packets that occurs in front of a congested router. Such buffering 
involves a tree structure of buffers. As a result, congestion at one router in a given stage can affect a large 
number of other routers. 
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In this section, wc examine tree buffering in front of a single congested router. In the next section, 
we examine a stage of routers and we examine the probability that at least one of the routers of the stage 
has a deep tree buffered in front of it. 

In this section, wc use a network model to examine how much tree buffering can occur in front of a 
congested router. In the model congestion can occur only at one router. Wc will study the model in 
order to determine the amount of buffering that occurs as a function of the amount of congestion and the 
overall network input rate. 

As wc shall see, the model suggests that deep tree buffering in front of a slow router occurs if the 
rate at which the router can accept packets is close to the rate at which packets that must go through that 
router arc generated. In particular, the model suggests that if the rate at which packets arc generated on 
each network input is IN and if the rate at which the router can accept packets on its input is OUT then 
the expected number of packets buffered in front of that input is greater than 

( -J^L 2) 

2 2(0 U I -IN )(B + 1) . j w here B is the size of the buffers. It should be noted that some aspects of 
this expression are intuitive. The n r, T ./iu or i. w/ nnT factor in the exponent is similar to the 
expected occupancy of certain types of queues in classical queueing theory, and reflects the queucing of 
packets that must be passed through the congested router. The exponential growth in the total number of 
packets buffered, most of which do not have to pass through the congested router, comes from the tree 
structure of routers involved. The .. . factor in die exponent reflects the influence of the size of the 
input buffers of the routers. 

The model that wc use is shown in Figure 9. The model is composed of a tree of routers as shown 
in the figure. The depth of the tree is a parameter of the model. The first output of the router at the root 
of the tree is connected to a probabilistic packet sink and the second output is connected to a perfect 
packet sink. All other routers have their first output connected to a router in the following stage and have 
their second output connected to a perfect packet sink as shown in the diagram. I-'ach input of each 



Fig. 9. Tree Buffering Model 
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router at a leaf of the tree is connected to a probabilistic packet source. 

The tree of routers of the model represents the tree of routers in front of a congested router in the 
network. The probabilistic sink represents the congested router. The perfect sinks represent the routers 
directly connected in the network to the tree of routers being studied. 



The routers in this model, unlike the routers of the network, operate instantly. Thus, a packet will 
ripple through the model in one time unit. It will cither be output at a perfect sink or it will run into 
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other packets buffered as a result of congestion at llic probabilistic sink. This assumption allows us to 
easily study buffering due to congestion at the probabilistic sink since it is the only type of buffering that 



occurs. 



The fifo buffers in the routers all have the same size B . /? is a parameter of the model. 

The probabilistic packet sink contains a fifo buffer whose size is the same as that of the buffers in 
the routers. Packets input to the sink arc placed in the fifo buffer. If at the beginning of a time unit the 
fifo is not empty then with probability OUT the sink removes one packet from the fifo buffer. OUT is a 
parameter of the model. 

The perfect sinks never block and accept packets at whatever rate they are presented. 

The probabilistic packet sources produce packets. If the input buffer connected to a probabilistic 
source is not full at the beginning of a time unit, then with probability IN the probabilistic source places 
an additional packet in the buffer. IN is a parameter of the model. The tag for each packet has as many 
bits as the depth of the network. Each bit of each tag is independently and randomly selected with one 
and zero being equally likely. 

We can obtain a rough understanding of the operation of the model without much effort by 
considering the packets that are buffered in die model and that arc tagged for the probabilistic sink, and 
by examining the expected number of such packets as a function of IN and OUT. For the purpose of 
discussion, we call these ftps (buffered probabilistic sink) packets. 

ftps packets can only leave the model at the probabilistic sink. From die operation of die model we 
can conclude that if at least one bps packet exists dicn the buffer in the probabilistic sink must contain at 
least one packet. Thus, in a given unit of time if any bps packets exist then with probability OUT one will 
be consumed by the probabilistic sink. 
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A bps packet can enter the network from any one of the probabilistic sources. If the depth of the 
model is J then the number of sources is N where N is equal to 2 . The chance that an unblocked source 
produces a bps packet in given unit of time is //V/N. If N is large and if none of the sources is blocked 
then the chance that k bps packets arc produced in a given unit of time may be approximated by 
* 'r-j . The accuracy of this approximation increases with the size of N and is exact for infinite N. 

In die following paragraphs we will examine for a very large network model the expected number 
of* fyw packets as a function of IN and OUT. We will assume that OUT > IN . We will assume that the 
network is large enough that the chance of a blocked source is small and can be ignored. We will assume 
diat the network starts at time zero with no bps packets. 



Proposition. The expected value of the limiting distribution for the number of 6/w packets is equal to 
2!N-IN 2 __ (2) 



2{OUT-IN)' 



Proof. Wc find the average number of bps packets using an approach similar to that used by Klcinrock 
for the M/G/l queue [12]. For the purpose of discussion, wc use the notation q n to represent die 
number of bps packets at time n. We use A n + ± to represent the number of bps packets served between 
nandn+\. A + j is of course equal to eidier zero or one. We use v /J + 1 to represent the number of bps 
packets generated between n and n+ 1. 

From these definitions it follows that q n + ^ the number of bps packets at time n+ 1, is given by die 
equation 

9«+i = ?«- A /»+i + Vn- (3) 

If wc square both sides wc get 

q n+l 2 = Q tl 2 + A /;+1 2 + v n+l 2 -2q n A n+l + 2 ?/I v /J+l -2A„ +1 v /|+1 . (4) 

Let us form die expectation of both sides. Wc use the notation E[x] to represent die expected value of 
x . Also wc make use of the fact that A //+ j 2 , the square of the number of bps packets served between n 
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and n+ 1. is cither one or zero and is equal to A //+ j. I'[q II + f L Hie expected value of the square ot the 
number of bps packets at time n+ 1, is given by the equation 

/•:k //+l 2 l = /a^ 2 l + /'.'tA„ +1 ] + /''[" / , + i 2 ] 

- 2I-[q n A n+l ] + 2/: [<,„ v ;j+1 ] - 2/:[A /I + 1 v ||+1 ] • (5) 

Since v . ^ tlie number of bps packets generated between «and/;+l, is independent of q n and A, (+l , 

/a^ +1 2 ] = /^^ 2 ] + /aA /;+1 ] + /av, ;+1 2 ] 

-2A-[^A, ;+1 ] + 2^[, /; ]/:'[v /!+1 ]-2/aA, + 1 ]/:[v, i+1 ]. (6) 

We arc interested in the limit as /; goes to infinity. Wc are interested in the limiting distribution for 
die random variable q the number of bps packets at time n. We denote the limiting distribution by q. 
Similarly wc use the limiting distribution for the random variable v^, the number of bps packets 
generated between // and n+1. We denote the limiting distribution by v. We assume that die y'th 
moment of q exists in the limit as n goes to infinity independent of/;, namely, 

lim^co /<[<,,/] = ^7 7 ']- ( ? ) 

We make a similar assumption about the j\h moment of v /( . Wc make use of die fact that lim^^oo 

£[A , jj must equal the average input rate, IN . Thus, 

E[q 2 \ = K[q 2 ] + IN + E[~> 2 ] + 2E[q]K[ v] - 2( !N)kTv] - lim^,*, 2E[q n A„ +1 ].(8) 
The probability that A /; + 1 = 1 given that q n > is OUT. Thus, 

lim„-oo E[Q n ^ n + ]] = (OUT)E[q\ 
and 



(9) 



= IN + F[~v 2 ] + 2F.[q]i:[~\ - 2{1N)1:[~] - 2{OUT)F.[q]. (10) 

(i\j\k -IN 
The probability diat ~ = k is y '. , . This of course is die Poisson distribution. Thus, 

L'l'v] = IN , (ID 

F.[l> 2 \ = IN 2 + IN , (12) 

and 

= IN + IN 2 + IN + 2(/N)l'[q] - 2//V 2 - 2(()UT)i:[q] (13) 
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or 

2/.V - IN 2 = 2(OUT - IN )li[q] . 04) 

Therefore, /'[ql the expected value of the limiting distribution for the number of bps packets, is equal to 



2//V-//V 



2 



(15) 



KOUT-IN)' 
'This ends the discussion of the proposition. I 

Thus, the expected number of bps packets is very large if and only if OUT is very close to IN. For 
example, if OUT -IN is equal to \/a for some constant a then the expected number of bps packets is less 
than a IN . 

Other measures of die amount of buffering in the network model can be deduced from this result 
for bps packets. 

For die purpose of discussion, we define some notation. We use the notation p n to denote the total 
number of packets buffered at time n. We use p to represent die limiting distribution for the random 
variable p n . Wc define fj ( / ) as follows. 

//(/) = 0, ifi = 0, 

and 

/,(/) = log 2 i, ifi>0. 

(16) 

We use / to represent f/ (p n ). Wc use T to represent die limiting distribution for the random variable 

V 
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Proposition. The expected value of p, the limiting distribution for the total number of packets 
buffered, is greater than 

( m. 2 ) 

2 { 2(()UT-/N )(B + l) z; .,. (17) 



Proof. The desired result can be obtained from a lower bound on the expected value of / , the limiting 
distribution for fj(p n )- And E[T] in turn can be obtained from the expected value of q, the limiting 
distribution for the number of bps packets. 

/ fi(p n ), can be related to q )V the number of bps packets at time n. We use P[x - i] to 
represent the probability that x equals / and P[x - i\y-j\ to represent the probability that x equals / 
given that y equals j . Clearly, E [q n ], the expected number of bps packets at dme n, is equal to 

v . - v . /pin -,i/ -r(;\ipu -f(;\] (13) 

Wc use tlie notation E[x \y - i] to represent the expected value of x given thaty equals /. Thus, 

E[Q„] = *i>Q (Elq n \l n =fi(0])PV n =//(/)]. (19) 

An upper bound on E[q n \l n =fj ( /)], the expected number of bps packets at time n given that the 
total number of packets buffered at time /; is equal to /, can be obtained by examining the model in more 
detail. 

Packets buffered in stages close to the probabilistic sink arc more likely to be bps packets than 
packets buffered in distant stages. Consider Figure 10. For the purpose of discussion wc number die 
stages of the routers in die model as shown. All die packets in the fifo buffer of the probabilistic sink 
must be bps packets. If a fifo of the router in stage one contains one or more packets then die packet at 
the output of that fifo must be a bps packet. If that fifo contains more than one packet, any packet that is 
not at the output may be a bps packet. The probability that such a packet is a bps packet is 1/2. Similar 
statements can be made about packets buffered in higher stages. The probability that a packet at the 
output of a fifo in stage k is a bps packet is 1/2^ ~ l \ If that fifo contains other packets, the probability 
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stages 



that such a packet is a bps packet is 1/2 



Thus, we can get an upper bound on E[q n \l =J) ( /)] by assuming tliat the buffered packets are 
packed in the lower stages. We assume that at time n all of the / buffered packets arc in the (f 2 ( / ))+ 1 
lowest stages where j^ii) is the smallest non negative integer such that the capacity of the lowest 
(/)(<))+ 1 stages is greater than or equal to /. The (./)('))+ 1 lowest stages include stage through 
stage j~2 ( / ). Notice that 
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£ *=oto((/ 2 (/))-i) ,nk < ' ^ z *=otoy><o /?2 * for ' >0 (20) 

and 

/i^^'^-lK / </A2 (( ^? ( ' ))+1) -l) fbr/>0. (21) 

As a result, 

f 2 ( ' ) < lo 82( ' ,i] + U for ' ^° ' (22) 

Given Uie discussion of the previous paragraph, the definition of f 2 (i\ and the fact that the kih stage 

contains 2 k fifos, (E[q \l n = fj ( /)]), the expected number of bps packets at time n given that the total 

number of packets buffered at time /; is equal to /', is less than or equal to 

or 

(E [Q n \l n =f 1 U)\)<B + (f 2 (/))(*+ 1) . (24) 

Given the relation (22), we can conclude that 

(E k n \l n = // ( O]) < R + (log 2 ( i/B+ l))(R + 1) . (25) 



This implies a relationship between E[q n \ the expected number of bps packets at time n, and 
E[l I the expected value of/) (the total number of packets buffered at time n). If we substitute (25) into 
(19), we get 

E [q n } < 2 / > W + 0og 2 ( ' / B + l))(B + l))P [/,, = /)(/)]. (26) 

Thus, 

E [q„ J < ( B + 0( B + l))P [l n = /) (0)] + 2 ,• >0 (( # + (log 2 ( / / fl + 1))( fl + I ))/> [/„ = /, (* )]) • (27) 
Since log 2 ( //# + 1) is equal to log 2 ( / (UB + 1/ / )) and is thus equal to (log 2 / ) + log 2 (l/£ + 1/ /), 

E[Q n ]< (B +<KB + l))P[l n =f l (0)] 

4- 2 />0 ((/? +((log 2 /)+log 2 (l//? + 1//)XA + l))/M/„ =//(')]) • (28) 
By (16), the definition of /) , 

/•"[</„] <2/> (<tf+(tf + l)/ / (/) + (/i+l))/»[/, l =/ / (OD- (29) 

Thus, 
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K[<l n ]<(B + \mi n ] + 2B+l. (30) 

Wc assume that /•.'[/ ] also exists in the limit as n goes to infinity. Thus taking the limit, 

l[ r ]> ElSA.VL±l (3D 

L[li - B + \ B+l ■ ( ' 

Since 2/N-IN 2 >IN , (15) and (31) imply that /;'[/"], the expected value of the limiting distribution 
for f, {p ), /, (the total number of packets buffered at time n), is greater than 

IN - (32) 



2(OUT-lN)(B + l) 



This can be used to obtain a lower bound on E\p], the expected value of the limiting distribution 
for the total number of packets buffered. Since p n > 2^ l n ' - 1, 

E\p]>E[2 l ]-l (33) 

where p is the limiting distribution for p and p n is the total number of packets buffered at time n. 
Since exponentiation is a convex function, (33) implies that 

E\p)> 2 E[Tl l. (34) 

Thus, from (32) we can conclude that E\p\ the expected value of the limiting distribution for the total 

number of packets buffered, is greater than 
( !£ 2 ) 

2 [ 2(OUT-IN)(B+l) z; . L (35) 

This ends the discussion of the proposition. I 

Thus, the results of the model suggest that deep tree buffering will occur in front of a slow router if 
the rate at which the router can accept packets is close to the rate at which packets that must go through 
that router arc generated. If the rate at which packets arc generated on each of the network inputs is IN 
and if the rate at which the router can accept packets on that input is OUT then the model suggests that 
the expected value of // (the number of packets buffered in front of that input) is greater than 
- 2. Similarly, the model suggests that the expected number of packets buffered in 



2(OUT-IM)(B + \) 
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front of thai input is greater than 2 UOUT-IN ){ li + 1) - i. | n t hc next section, we examine a stage of 

routers and we examine Lite probability that at least one of thc routers of thc stage has a deep tree 

buffered in front of it. 

2.2.3.3 Effect of the Last Stage 

In diis section, we examine the effect that congestion in routers of die last stage of an indirect 
n-cube has on die network's operation. As we have seen in thc previous section, congestion in a router of 
a given stage can easily affect many routers of an earlier stage. We now consider congestion in all the 
routers of a given stage. We use thc last stage because analysis of die last stage is somewhat easier than 
analysis of other stages. There is a path from each network input to :ach router of die last stage. Thus, 
congestion in any ioui.ci of die last stage may affect all of the network inputs. 

We use a network model to study thc effect of thc last stage. This model represents an indirect 
n-cube network with its outputs connected to perfect packet sinks. The model considers only buffering 
caused by congestion in routers of thc last stage of die network. We use the model to study the limit that 
such buffering places on the throughput of an indirect n-cubc network. 

Rather than analyze the model directly, we choose to transform the model, analyze die transformed 
model, and use thc results to draw conclusions about thc original model. 

Based on the results of thc model, we conclude that die effect of thc last stage of routers in a 
network docs not place a severe constraint on thc diroughput of the network. 



- 39 



Model of the Iffevt of the Last Stage 

In the following paragraphs, wc discuss the model; wc describe die model, discuss the relationship 
between the model and the indirect n-cube network, discuss the characteristics of network buffering that 
will be studied using the model, and describe the details of the components of the model. 

The model is composed of nodes, probabilistic sinks, and a deterministic source. The model is 
constructed in the manner shown in Figure 11. The model for a 2-input network is constructed from a 
deterministic source and a probabilistic sink. The model for a N-input network is constructed from a 
deterministic source, N/2 probabilistic sinks, and a tree of nodes with N/2 leaves. A tree with 2 leaves is 
one node. A tree with N leaves is constructed from two trees with N/2 leaves. 

The model reflects primarily two features of an indirect n-cube network: the probabilistic input rate 
of routers of die last stage, and die buffering capacity between each router of die last stage and the 
network inputs. The probabilistic sinks of die model represent the routers of die last stage of the network. 
The nodes of the model represent die routers of the other stages of the network. The nodes of the model 
are connected in a tree structure as shown in Figure 12. If the stages of the model arc numbered from the 
root to die probabilistic sinks and if the total number of stages is d dien each packet in stage / of the 
model represents 2 d + 1 " ' packets in stage / of die network. Each node of die model has an input buffer 
of size B where B is the size of the buffers in the network. Thus, the buffering capacity of the model 
between a probabilistic sink and the model input represents the buffering capacity of the network 
between a router of die last stage and the network inputs. 

Wc use the model to study the buffering of packets in an indirect n-cube network caused by routers 
of die last stage. The nodes of the model do not operate in the same manner as the routers of the 
network. Each node of die model evenly splits between its two outputs the flow of packets from its input. 
A node will instantly process a packet in its input buffer unless one of die buffers connected to its outputs 
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Fig. 11. Model of the Effect of the Last Stage 
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is full. Thus, the buffering of packets in the model can be caused only by the probabilistic sinks and 
represents the buffering in the network caused by the routers of the last stage. 



Wc also use the model to examine the limit that conflict in the last stage of network routers places 
on network performance. The model accurately reflects the buffering capacity of the network between a 
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Fig. 12. Nodes of the Model of the Effect of the Last Stage 
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last stage router and the network inputs. Thus, the maximum input rate that can be supported in die 
model without source blocking is an indication of the limit diat conflict in the last stage of network 
routers places on network performance. 

The probabilistic sinks of the model arc similar to the probabilistic sinks used in the previous 
section. Kach sink contains a buffer of size B where B is the size of the buffers in die network. If at die 
beginning of a time unit the buffer is not empty then with probability .75 the sink removes one packet 
from the buffer. The average rate at which the sink can remove packets from its buffer corresponds to 
half the average rate at which a router of the last stage of die network can accept packets since each packet 
removed by die sink represents two packets of die network. It should be noted that the probabilistic sinks 
are a pessimistic model of the routers of the last stage. The probabilistic sinks of the model can fail to 
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accept a packet for an arbitrarily long period of time. TN 'liters of Lhe last stage of the network arc 
capable of accepting at least one packet during each time unit. We use this pessimistic model because it 
makes the discussion simpler and because even with such a model, our analysis suggests that congestion 
in the last stage of routers does not place a severe constraint on Line throughput of the network. 

Each node of the model, in effect, evenly splits between its two outputs the flow of packets from its 
input. A node is assumed to operate instantly. If a node's input buffer is not empty and if both buffers 
connected to the outputs of the node are not full then the node removes one packet from its input buffer 
and places a copy of the packet in each of the two buffers connected to its outputs. 

The deterministic source generates packets at a constant rate. The deterministic source generates 
each packet in the form of 100 subpackets! The source generates IN subpackcts in each unit of time that 
it is not blocked. IN is a parameter. The source buffers the subpackcts internally until the first unit of 
time in which it can output a whole packet. Thus, the source-only outputs whole packets. 

Transfo/wicd Model 

Rather than analyze die model directly, we choose to transform the model and analyze the 
transformed model. We discuss the relationship between results for the transformed model and results 
for the original model. 

The transformations that wc make are shown in Figure 13. 

'Hie first transformation takes the buffers that were distributed in die tree of nodes and aggregates 
them at the probabilistic sinks. The probabilistic sinks in the transformed model contain buffers of size 
(log 2 N)tf where N/2 is the number of probabilistic sinks and B is the size of the buffers in the original 
nodes. The deterministic source of the transformed model is directly connected to each of the 
probabilistic sinks. The deterministic source is blocked if any of the buffers in the probabilistic sinks arc 
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Fig. 13. Transformations on the Model of the Effect of the Last Stage 
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full. In each time unit that it is nut blocked, the deterministic source simultaneously generates IN 
subpackets for each probabilistic sink. When the source has formed whole packets, it simultaneously 
outputs one packet to each of the probabilistic sinks. 

The buffers in the transformed model have a greater independence than the buffers in the original 
model. In the transformed model, the buffering of packets caused by a particular probabilistic sink can 
affect the flow of packets into another probabilistic sink only by blocking the deterministic source. In the 
original model, the buffering of packets caused by a particular sink can affect other sinks by blocking 
nodes of die tree. 

It can be shown that the maximum input rate of the transformed model is at least as great as the 
maximum input rate of the original model. Thus, limits on the performance of the transformed model 
also apply to the original model. An upper bound on the performance of the transformed model implies 
an upper bound on the performance of the original model. 

The second transformation turns the buffer of size (log 2 N)# in each probabilistic sink into an 
infinite buffer in which we will look for occupancy of (log 2 N)Z? packets, and the second transformation 
also turns die single deterministic source into a large number of deterministic sources with one associated 
with each probabilistic sink. We assume diat each of the new deterministic sources operates in a manner 
similar to that of the original deterministic source. Fach source generates IN subpackets in each unit of 
time. The source buffers subpackets until it has produced more than 100 subpackets. The source outputs 
each group of 100 subpackets as a whole packet to its associated probabilistic sink. However, we assume 
diat the state of each source-sink pair is independent of the state of die other source-sink pairs. We also 
assume that die sources arc never blocked. 

In the following paragraphs, we refer to the model after the first transformation as die model of the 
first transformation, and we refer to the model after both transformations as the divided model since the 
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model is divided into independent source-sink pairs. 

Our analysis of the constraints on the input rate of the divided model also applies to the input rate 
of the original model. We examine the divided model to determine input rates that imply a high 
probability that at a randomly selected point in time at least one of the buffers contains at least (log 2 N)/? 
packets. It seems reasonable to assume that such input rates can not be supported in the model of the 
first transformation since that model is similar to the divided model but has buffers of size (log 2 N)#. 
Thus, it seems reasonable to assume that such input rates can not be supported in the original model since 
the maximum input rate of the original model is less than the maximum input rate of the model of the 
first transformation. 

However, as we shall see our analysis of the divided model and simulation of the original model 
indicate that both models can support input rates close to the maximum rate at which a probabilistic sink 
can accept packets. Thus, our models suggest that conflict in a given stage of routers docs not place a 
severe constraint on the overall throughput of the network. 

Probability of Buffering in the Divided Model 

In this section, we examine the Markov chain for the state of one of the source-sink pairs of the 
divided model and use the results to bound the probability in die divided model that at a randomly 
selected point in time the buffers in one or more of the sinks have at least (log 2 N)fl packets. 

The Markov chain MCp for the state of one of the source-sink pairs of die divided model is shown 
in Figure 14. The state is equal to the number of subpackcts being stored at the source plus 100 times the 
number of packets in the buffer of the sink since each packet is composed of 100 subpackcts. For the 
purpose of discussion, we use die notation P/)(iJ) to refer to the probability of a transition to state j 
given that the chain is in state /. A stationary distribution, Pq$, for the chain is a distribution such that 
P l)v{ /), the probability that die chain is in state /, is given by 



Fig. 14. MC D 
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We refer to (36) as the equilibrium equation for P^iO- 



(36) 



As is shown in the following subsection, any stationary distribution, Pq$, for MCq must be such 



that 



and 



where 



(lV'-100 + //V )(/iV/75) > 2 /=100toy P DS (i) for;>100 



i = 100 to 



j + IN -I P DS(^ £ (l-« 7 "")(^V/75) for y>100 



(37) 



(38) 



(KcKl 



and a is the root in this range of the equation 
a W =.25 + .75« 100 . 
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Kor ihc purpose of discussion, wc define some notation. We use the notation /' . to represent the 
probability in the divided model that at a randomly selected point in time a given buffer has at least (log2 
N)H packets. Wc use the notation P< to represent the probability that at a randomly selected point in 
time one or more buffers have at least (log2 N)# packets. 

The relations, (37) and (38), for one buffer can be used to obtain results for the overall divided 
model. 



Proposition. 



!. e ( 150Ur b <l-c m N-a2 2//? t (39) 



where Pl is die probability that at a randomly selected point in time one or more buffers have at least 

(log2 N)Z? packets, and a is such that 

-1 (log 2 fl) 

a = 2 ( 100^ + L00(log 2 N)5^ (4Q) 



Proof. We use (37) and (38) to show the desired relations (39). We express /^, die probability that at a 
randomly selected point in time one or more buffers have at least (log2 N)/? packets, in terms of P ^ , the 
probability diat at a randomly selected point in time a given buffer has at least (log2 N)/? packets. We 
bound /' - in terms of a stationary distribution for die Markov chain MC q . Wc then use (37) and (38) 
to get die desired bounds, (39), on P< Q . 

Since each connected source and sink of the model are independent of the other sources and sinks 
of the model and since there arc N/2 sinks, 

P b =l-(l-/' g6 ) (N/2 >. (41) 

P . , the probability that at a randomly selected point in time a given buffer has at least (log2 N)tf 
packets, can be expressed in terms of /Vy where P ,^ is some stationary distribution for M( q such 
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that the probability that a buffer is in state / at a randomly selected point in time is equal to /'/;<>•(')• 
The choice of the particular stationary distribution may depend on the initial state of the buffer. P ^ is 
equal to X / > 10 0(log 2 N)/i P l)S ( ' > which is cc l ual t0 

l '^i = IN to99 /, /Ji ,( 'W" (2 / = 100tol00(log 2 N)fl-l P DsW- (42) 

P l is also equal to 

(I / = 100toOO P O ( S' ( ' )) " (I / = 100tol00(log 2 N)//-l P DS ( '^- (43) 



Using (37), (38), (42), and (43) wc can bound P ^, the probability that at a randomly selected point 
in time a given buffer has at least (log 2 N)7? packets. Since the long term average rate of packets out of a 
buffer of the divided model must equal the rate in, 

.75(2- >ifj0 P DS (0)= IN/100 (44) 



Thus, 



Given (38), 



( 2 / = Wto99 /, ^(')) = l-//V/75. (45) 

^<W/75-(2 / = 100to ioo(log 2 N)/?-l /, Z)s(')). (46) 



P gb <(IN/75)a( m ^2^ B ' IN -"\ (47) 



(37), (43), and (44), imply that 

P gb > (W/75HW/V5)(l-a 100(1 °82 N)/M01 + /iV } _ (4g) 

Thus, since a < 1 and IN < 101 we can conclude from (47) and (48) that 

(M /75)a (m\og 2 N)n-lN-99) >Psb >{IN/1 5)a m ^2^ B . (49) 

We introduce some notation that makes the discussion below simpler than it would otherwise be. 

In particular, we define a to be such that 

-1 (log 2 a ) 

a = 2 ( lU0/i + 100(k,g 2 N)/i ) _ (5Q) 

It should be noted that a is a function of IN , die sotircc input rate, just as a is a function of IN. 
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Given (49) and (50), /' . . the probability that at a randomly selected point in time a given buffer 

has at least (log2 N)/? packets, is less than 

/A / + 99 (//V+99)(log 2 «) 

{ - (lo8 2 N ) + -l00ir + (lo8 2 fl )--UJ0(log 2 N)« > 



(IN/15)2 



<(IN/15){a/N)2 2/B (5L) 



and 



P gb >(fN/15)(a/N). (52) 



From (51) and (52), wc deduce bounds on P^, the probability that at a randomly selected point in 

N/2 Ai( N/2 ) 
time one or more buffers have at least (log2 N)5 packets. Since < P b < 1, (^'Ppjj) < e 5 



Thus, with a defined as above P^ is greater than 



i-e'50\ (53 ) 



Since 0<P gb < 1, 

n-P ^ N/2 
(1 P gb ) 



= e 



<-V 2 «=2tDM(l' 0( V /)(N/2) 



>e 8 U & (54) 

Thus, with a defined as above P^, the probability that at a randomly selected point in time one or more 

buffers have at least (log 2 N)/i packets, is less than 

H QjN_ )2 7/B . (lN/15) 2 (a/N) 2 (N/2)2 4/B ) 
i-e 15 ° l-(/yV/75)(«/N)2 2/fl (55) 



and 



(t alN„VB a 2 2 4/li v 
P A <l-e ( i5 ° "N-«2^ ) . (56) 



b 
This ends the discussion of the proposition. I 
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Implications of the Models on Network Throughput 

The lower bound. (53), on /V Tor the divided model suggests a limit on the maximum input rate of 
the original model. In the divided model, /'. is the probability that at a randomly selected point in time 
at least one of the buffers has at least (log2 N)tf packets. As was discussed earlier, it seems reasonable to 
assume that input rates that imply a high P, in the divided model can not be supported in the original 
model. However, this docs not place a strong constraint on the input rate of the original model. From 
relation (53), wc know that if we want P^ in the divided model to be less than some value then wc must 

choose IN such that 1 - c ^ j s i css (j lan that value. As N goes to infinity, if a goes to zero and IN > 



, -a IN ) , -a/N 

1 then 1 - c 150 g 0CS to zcro _ /\ s jsj g 0CS to infinity, if a goes to infinity and IN > 1 dicn 1 - e ^0 



goes to one. Thus to find an upper bound on IN in the divided model for a particular P^ in the limit as 
N goes to infinity, we assume that a approaches a constant independent of N. In such a case, equation 
(50) implies that a approaches 2~ 1/ ( 100 '''). The corresponding value of IN can be deduced from the 
equation, a - .25 + .75a . For example, if B is equal to five then a must be less than 2 _1/ - . 
2-1/500 j s approximately equal to .998615. This implies an upper bound on IN of 73. For comparison, 
the upper bound on IN implied for B equal to one is 67, the bound for B equal to two is 71, and the 
bound for B equal to ten is 74. Obviously, these upper bounds arc close to the upper bound of 75 placed 
by die fact that, as was discussed in die description of die original model, a probabilistic sink of this 
section can only accept packets at an average rate of .75 packets per unit time. 

In fact, it seems Uiat if the buffers of the original model have at least modest si/c then the average 
input rate of the model can be close to the average rate at which packets can be removed from the buffer 
of a probabilistic sink. We have simulated the original model for several buffer sizes. The model was 
simulated with both 512 probabilistic sinks and with 1024 probabilistic sinks. A simulation run was made 
for each combination of model width and buffer size. In each run, the deterministic source was capable 
of generating a packet in each time unit that the source was not blocked. Thus, the rate at which packets 
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were generated by the source was determined by blocking. The buffers in the model were full at the 
beginning of each run. For each run, we measured the number of packets that were generated by the 
deterministic source during the run. The results arc shown in Tabic I. I -'or each combination of model 
width and buffer si/.c, the average rate at which packets were generated is listed. As can be seen from the 
table, the results for 512 sinks arc close to the results for 1024 sinks. For both sets of results, the average 
input rates arc not far from .75 packets per unit time for even modest buffer sizes. 

Thus, our study of the original model suggests diat slow routers of die last stage in an indirect 
n-cubc network do not place a severe constraint on the diroughput of the network. 

The Stationary Distribution for the Divided Model 

We now return to the detailed analysis of the divided model in order to show the bounds, (37) and 
(38), on any stationary distribution for the divided model. 

Since direct analysis of MCq , the Markov chain for the divided model, seems difficult, wc 
indirectly analyze it by comparing it to two simpler chains. The first of these chains eliminates die first 
100 states of the original chain since we arc primarily interested in the later states of the original chain. 
The second chain is easier to analyze than die first chain and gives information about the first chain and 
the original chain. We introduce each of these chains and show the the relation between die stationary 

Table I. Simulation of the Model of the Effect of the Last Stage 

B 1 2 3 5 10 

av. in for 512 Probabilistic Sinks 50.8 64.0 67.6 70.6 72.6 
av. in for 1024 Probabilistic Sinks 50.7 63.5 67.6 70.6 72.8 
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distributions of each and the stationary distributions of MCp, the Markov chain for the divided model. 
Wc obtain the stationary distribution for the second chain and use it to bound the stationary distribution 
for MC D . 

The Markov chain MC '^j shown in Figure 15 is very similar to the chain for die divided model 
with the exception diat the states less than 100 have been removed. For 100 < / < 199- IN , wrap(i) is 
equal to the smallest j such that for some integer k, k IN + i - 100 -I- IN - j and j > 100. wrap(i) 
is the first state greater than or equal to 100 that would be reached after a transition down from state /'. 
We use die notation Pq[ ( i,j) to refer to the probability of a transition to state j given that the chain is in 
state /. A stationary distribution, Ppj^, for the chain is a distribution such that /*/)/£ (0. the 
probability diat the chain is in state /, is given by the equation 

P DIS < ' > =- 2 y > 100 VDlV'' )P D1S 0' ) • (57) 

Wc refer to (57) as die equilibrium equation for /'/j/^-CO- The equilibrium equations of MC qj , are 

Fig. 15. MCjyi 
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vcry similar to ihc equilibrium equations of the divided model. 

This similarity can be seen more clearly if Pj^(IN) through /'/jyC^). the stationary probabilities 
for suites less than 100 in the divided model, arc eliminated from the equilibrium equations of the divided 
model. We can eliminate P^^(IN) by using the equation for Pp^(IN) to substitute for Pj)^(IN) in 
the remaining equations. This process can be continued for Pq ^ ( IN + 1 ) through Pp^ (99). The 
resulting equation for P^ii) for each / > 100 corresponds directly to the equilibrium equation for 
Pj)is(i)- In particular, if we take the equation for P[)^(i) and map P/j^(j) to Pj)isU) for each j > 
100, we obtain an equation which is identical to the equilibrium equation for P/j /$('■)■ From this we 
can conclude that for any stationary distribution P^ for the Markov chain MCp, the chain for the 
divided model, there exists a stationary distribution Ppjs for the Markov chain MC^j such that for 
soiiic coiistaat c aiid for all / > 100, "[)/$(') — <•' '"/xs ( ' )■ Since ^/>iQ0 P DIS^'^ ~ l ' we caa 

conclude that c - -r=, „ — and that 

^i>\W' DS {I) 

P D1S(» = t 1 /» (i) P DS^ i> >- (58) 

The chain MC ^ shown in Figure 16 is related to MCqj and is therefore also related to MCq, 
the chain for the divided model. We examine MC ^ because it is related to MCq and because, as we 
shall sec later, the stationary distribution for MC ^ ' s casv to determine since there is a simple geometric 
relationship among the probabilities of the various states. We use the notation fy^v '•./) t0 K ^ cr t0 t ^ ie 
probability of a transition to state j given that the chain is in state /. If a is such that a = .25 + 
.75« 100 then 

p D2 (i,j) = .75a^'" 100) (-^-) for 100</<199-/tf and 100</<99 + /JV, 

p D2 (i,i + IN) = .25 for /> 100, 

p D2 ( / , / - 1 00 + IN ) = .75 for / > 200- IN , 



and 



P/)2 ( ' 'J) ~ ® otherwise 
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Fig. 16. MCj)2 
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(59) 
The stationary distribution Pq 2 S *" or ^ c Markov chain MC q 2 is the distribution such that P/) 2 ^(0< 
the probability that the chain is in state /, is given by the equation 

p D2sW = *j>m pd2U<Wd2sW- . (60) 



The transitions of the chain MCq 2 arc similar to those of the chain MCqj . Many of die possible 
transitions of MC q 2 nave ^ c samc probabilities as the corresponding transitions of MCpj . In 
particular, 

p D2 (i,i + IN) = p D} (i,i + IN) = .25 for />100 

and 

p D2 (i,i-W0+IN) = p DJ (i,i-l0O + IN) = .75 for ;>200-//V . 

(61) 
The transitions of MC /)2 from / to j where 100 < / < 199- IN and 100 < j < 99 + IN differ from the 
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corresponding transitions of MC'ni ■ lint, for any / such that 100 < / < 199-//V, 

2 j = 1 00 to 99 + IN P D2 dJ)= z y = 1 00 to 99 + IN p Dl ( ' •• / > = (J5) ■ (62) 

Wc show that any stationary distribution Pq/^ for the chain MCpj is related to the stationary 
distribution Pq2S ^ or ^ c cna i n ^^D2' 

Proposition. 

2 i = 100 to j + IN - 1 ^25 ( ' ) ^ X i = 100 to j P DIS ( ' ) (63) 

and 

^ i = 100 to j + IN -1 P DIS^ i> > ^ 2 /=100to; P D2S^^- (64) 

Proof. This can be seen by comparing the operation of the two chains during a long period of time. For 
the purpose of discussion, we define Pn, (t, i) to be the probability that chain MC 'qj is in state (' at 
time /. We define P/^O. t0 De the probability diat chain MC ^ ' s ' n state ' at t ' mc l - We assiimc 
that for all />100, P^j (1, i) is equal to /^(l* ') anc * is also cc l ual t0 P DIS^'^ wncre P DIS is ^ c 
given stationary distribution for the chain MC^j . 

Wc show a relationship between Pjjj and P ^ mat exists for all / . Using this, wc show a similar 
relationship between P/)i$, the given stationary distribution for die chain MCj^j, and Pj)2S' ^e 
stationary distribution for the chain MCq2 ■ 

In particular, we show by induction on / that for all t >0 and j > 100 

2 / = 100 to ./ + /AM ^^ ' -) ^ 2 / =100 toy P /)/ ( '' ') (65) 

and 

S / = 100 to j + IN -1 ^/J/ C 0> Z f - = i00to y P D2 (' ■ ' > • < 66 > 

Clearly, these relations hold for / = 1. Wc show the case for />1 by considering die transitions of each 

chain. From the transitions of the chain MCp, , it can be shown that for />1 and j > 100 that 
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or 



and til at 



or 



2 / = 100 toy r DI {l - j) 



<- 75I / = 100to I99-//V /, /;/ ( '" 1 - /-) 

+ - 25X / = 100toy-//V Oj/^" 1 ' () 

+ • 75X /=200-/A' toj + 100-IN ''fl/*'' 1 ''* ( 67 > 



X / = 100toy P Dl('> '> 



< - 755: ,■ = 100 to j + 100- IN P Dl (M - ') 



Y i = 100 to j+ IN -l P Dl('> ') 



• 752 / = 100 to 199-/7V P /)/ (M ' ') 

+ - 252 / = iootoy-i ^o/C'" 1 ' 

+ - 752 /=200-Wtoy+99 fy/*'" 1 ' '> (69) 



X (=100toy+/AM P Dl^ 1 ' ') 

= - 75S / = 100 to y +99 P Dl ('"^ ') 

+ - 252 i=100toy-l ^/('-l. 0. (70) 

Similarly, from the transitions of the chain MC ^ lt - can ^ e shown for />0 and j > 100 that 

2 / = 100 toy ^ZH^' ') 

^ J5 X / = 1 00 to j + 100- IN P D2 (' " l • ' ) 

+ - 252 ,- = 100toy-W ^0-1.0. (7D 



and that 



E / = 100toy+//V-l 'W' () 



= .75 X 



/ = 100 toy +99 /> /)2( / " 1 ' '> 
+ - 25Z « = 100toy-l ^('-l-'). (72) 



Thus, given the hypothesis of induction we can conclude that the desired relations, (65) and (66), hold for 
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/>1. As a result, wc can conclude that the desired re' lic-iis hold for all />(). In other words, for />0 and 
,/>100 

Z / = 100 to ./ + IN -I P D2 ('-»> 2 ,- = ioo to j P l)l *' ' '"> (73) 

and 

1 / = 100 toj + IN-l P Dl (l ' ' > - S / = 100 to j P D2 (/ ' ' > ■ (74) 

Since Pq/O. equals Pj)^(') for />0 where P/jj^ is die stationary distribution chosen 
above for the chain MC^/ and since in the limit as / goes to infinity P^i 1 • ') § ocs t0 P D2S^'^ wncrc 
Pq2S ' s tnc stat ' onar y distiibution for the chain MC ^ - we can conclude that 

2 /=100to;+//V-l P D2S^ l) ^ 2 / =100 to; P /)AV (/) (75) 

and 

2 / = 100to./' + //V-l P DIS(') ^ 2 <=100to; P D2s( r >- (/0J 

This ends die discussion of the proposition. I 

We can relate any stationary distribution Pq$ for MC D , die chain for die divided model, to the 
stationary distribution Pq2S ^ or ^ 1C cna ' n ^^ D2 ■ 

Proposition. 

2 /=100toy + W-l W"W D 2SW * S i-100toy '/W^ for ^ 100 (77) 

and 

2 /=100toy + /iV-l W> > 2/ = l00toy W"W D 2sW fory:>100. (78) 

Proof. From (58), we know diat for any stationary distribution Pj^ for MC p there exists a stationary 
distribution P^j ^ for die chain MC ^j such that for ( > 100 

P DIS(^=Y 1 /> (i) P DsW- (79) 

;>100 'AS'U' 

Since the long term average rate of packets out of a buffer of die divided model must equal the rate in, 
.75(2 />100 P DS (i))= W/100. Thus, 
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P DS (i) = (IN/15)r D/s (i) for /> 100. (80) 

Given the relationship between any stationary distribution for MCni and the stationary distribution for 
MCj)2 • wc can conclude that 

Z / = 100toy + W-l UNrJ5)P D2S (i) > ^ i = lQOtDJ P DS U) for y^ 100 (81) 

and 

Z / = 100toy+W-l P DSM * 2 1 =100 toy ^/1S)P D2S U) fory>100. (82) 

This ends the discussion of the proposition. I 



for MCj)2 are 



flic stationary distribution for the chain MC[) 2 can t> c easily obtained. The equilibrium equations 

C D2 are 
P D2S (l) = .15 P D2S (i +100- IN) + .75«( / - l0 °)(-^ r X2y = 1 ootol99-/iV P D2S^ 

furl00</<99 + fiV 
and 

P D2SW = • 75/, D2.S'( / + 100 " /yV > + 2SP D2S^' IN "> for '> 100 + w 

(83) 

where 

(Xa<l 
and a is the root in this range of the equation 

a IN = .25 + .75a 100 . 
From the equilibrium equations, it can be shown that 

P D2S ( ' ) = a ' ~ m P D2 s ( 10 °) for ' >10 ° • < 84 > 

Since S ; - > ^qq P[) 2 ^(i) = 1, we can conclude that 

^^(100) = (l-a). (85) 

Thus, 

P D2S ( ' ) = o ' " 100 ( 1 '«) for / > 100 . (86) 
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From (86). the solution for the stationary distribution /'n;v '" or L ' 1C chain M('/)2< an ^ ^^ anc * 
(82), the relations between /'/;;<.• and any stationary distribution /%)<• for Llic chain of the divided 
model, we can conclude that 

(\-ai- m+!N )(IN/15)>Z i = mt0J l> /)S (i) for y> 100 (87) 

and 

* i = \00 to j + IN -[ P Ds(i ) ^( haJ ~ 99 X !N/15) forj>lOQ (88) 

where 

0<a<l 
and 

a lN =.25 4- ,75a 100 . 

2,2.3.4 Interaction of Stages 

In the following paragraphs, wc examine the interaction of routers of different stages. In the 
previous paragraphs, wc examined the effect of conflict at one router when conflict at all other routers 
was ignored, and wc examined the effect of conflict in a given stage of routers when conflict in all other 
stages was ignored. Now wc consider the effect of the interaction of routers in various stages of the 
network. 

The discussion has two parts. 

In the first part, wc consider all the routers along a given path through the network. The diffusion 
of packet flow diat occurs along such a path due to the interaction of routers of the path seems to be one 
of the primary factors constraining the overall network throughput. We examine the interaction of 
routers along a path by ignoring conflict at routers that arc not on the path. As wc shall see, the 
interaction of routers along an infinitely long path still allows a nonzero flow into the path. Since this 
type of interaction between stages represents the strongest constraint on overall network throughput that 
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we find, our study of such interaction suggests that the normalized throughput of the indirect n-cube 
network approaches a nonzero asymptote as the si/.c of the network goes to infinity. 

In the second part of" the discussion, we consider the interaction of routers in a tree of the network. 
The interaction among such routers can be quite complex. We examine one particular type of interaction. 
While this interaction docs not seem to have an important effect on the overall throughput of the 
network, it may cause a few of the routers connected to the network inputs to be slow for a long period of 
time. 

Interaction of Routers Along a Network Path 

We study the flow of packets along a typical path of the indirect n-cube network using the model 
shown in Figure 17. The model represents a path through the network. The model allows us to study the 
effect of conflict at routers along the path in die absence of conflict at other routers. We first describe the 
model, then we analyze it using simulation. 

Model of a Network Path 

The model reflects the interaction of the routers along a path of the indirect n-cube network. The 
model ignores the interaction in the network between a router on the path and any router not on the path. 
The model contains a sequence of 2-input 2-output nodes. The nodes of the model represent the routers 
along the network path. As shown in Figure 17, die first output of each node of the model, except the last 
node, is connected to die first input of the next node. The second output of each node is connected to a 
perfect sink. The second input of each node is connected to a probabilistic source. The first input of the 
first node is connected to a probabilistic source. The first output of the last node is connected to a perfect 
sink. The connections of the second input and second output of each node represent the connections of 
the corresponding router of the network to routers not on the path. The probabilistic source connected to 
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Fig. 17. Model of a Network Path 



prob 
source 





the second input of the node provides a steady flow of packets into the node. The rate of flow can be 
adjusted to equal the average flow into the corresponding input of the corresponding router. The perfect 
sink connected to the second output can not block. Thus, congestion in the model can be caused only by 
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conflict in the nodes and represents the congestion along the network path due to conflict in the routers 
of the path. 

Kach node of the model has a buffer on each of its inputs. The buffer on the first input has size B . 
The size of the buffer on the second input is several times B. The long buffer of the second input ensures 
that short term variations of the node do not affect the probabilistic source connected to the input. 

The operation of each node of the model is similar to the operation of a router in the indirect 
n-cubc network. Each packet entering the node is assigned a one bit tag which determines its route 
through the node. The tag is randomly selected with zero and one being equally likely. If the tag is zero, 
the packet must leave on the first otitput of the node. If the tag is one, the packet must leave on the 
second output of the node. In each unit of time, the node attempts to transfer a packet from each of its 
input buffers. For each input buffer, the node attempts to transfer the first packet, the packet that 
entered the buffer first, to the output that corresponds to its tag. The packet is transferred if the buffer 
connected to the desired output is not full, and if there is no conflict from the other input buffer or if the 
packet wins the arbitration of the conflict. In the case of conflict, the node randomly selects a packet to 
transfer. The two possible choices are equally likely. The node will transfer at most one packet from each 
of the input buffers in a unit of time. 

The probabilistic sources and perfect sinks used in this model arc similar to the corresponding 
devices used in the previous paragraphs. The probabilistic sources produce packets. If the input buffer 
connected to a probabilistic source is not full at the beginning of a time unit then with some probability 
the probabilistic source places an additional packet in the buffer. The probabilistic source connected to 
the first input of the first node generates packets with probability IN . The probabilistic sources 
connected to the second inputs of the nodes generate packets with probability SI. The perfect sinks 
never block and accept packets at whatever rate they arc presented. 
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Simulation of the Model 

Wc use simulation of the model to study the effect of conflict in a randomly selected network path 
on the rate at which packets can enter the first router of the path. In this section, wc describe the 
simulation of the model, discuss the implications of the simulation results, and compare the simulation 
results for the model to simulation results for the complete indirect n-cube network. 

Wc run the simulation in such a way that we can use the results of the simulation to draw 
conclusions about the limit that conflict in a randomly selected network path places on the rate at which 
packets can enter the first router of the path. In the model, wc set IN to 1. We examine the rate at which 
packets enter the first input of die first node as a function of the value of SI. We find a value such that 
when SI is set to the value, the rate at which packets enter the first input of the first node is also equal to 
the value. We refer to this value as the maximum input rate for the model. The maximum input rate for 
the model in some sense represents the limit that conflict in a randomly selected network path places on 
the rate at which packets can enter the first router of die path, if conflict elsewhere in the network is 
ignored and if it is assumed that each network input receives the same input rate. 

We have simulated die model for several values of H and for several path lengths. The maximum 
input rate for each case is shown in Table II. It should be noted that the values listed arc percentages and 
that only values with whole percentage points were used in the simulation. 

The model suggests that the effect of conflict in a randomly selected path increases with die length 
of the path. The maximum input rate for the model decreases as the length of the model increases. Each 
node can block all earlier nodes. Nodes at the beginning of a long path can be blocked by any of the later 
nodes. Thus, the maximum input rate for die model of a long path is less than the maximum input rate 
for the model of a short path. 
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Table II. Simulation of the Model of a Network Path 
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However, the model suggests that for very long paths the effect of conflict increases very slowly 
with path length. The chance in the model of a long path that all of the buffers between a late node and 
an early node are full is small. The chance that a conflict in the late node blocks the early node is small. 
While we will not discuss this issue in detail, we expect that the maximum input rate for the model 
approaches a nonzero asymptote as the length of the model goes to infinity, As is shown in Table II, 
128-nodc models were simulated. We expect that the maximum input rates for the 128-nodc models are 
close to the maximum input rates for infinite length models. 

The limit of the input rate of a randomly selected network path implies a limit on the overall 
throughput of a network. Clearly, we do not expect the total throughput of a network to be greater than 
die width of the network times the maximum input rate of a randomly selected network path. For most 
networks, this limit is stronger than any of the other limits studied so far. For example, this limit is 
usually stronger than the limit placed by the slowest router in the last stage when conflict in other stages is 
ignored. However, it is important to note that this limit docs not rule out high throughput for very large 
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nclworks. The ratio of this limit to the number of network inputs approaches a non/ero asymptote as the 
si/c of the network goes to infinity. 

We have not found any factors that place a significantly stronger constraint on network throughput 
than the interaction of routers along network paths discussed in the previous paragraphs. This suggests 
that the normalized throughput, throughput divided by the number of network inputs, of the indirect 
n-cubc network approaches a nonzero asymptote as the size of the network goes to infinity. 

For comparison, complete indirect n-cube networks were simulated. The results arc shown in Table 
III. While the simulation results do not clearly indicate the normalized throughput of networks of infinite 
size, die normalized throughputs of the networks simulated are consistent with the results of die model 
above since die normalized Uiroughput of each complete network is less but not drastically less than the 
normalized throughput of die corresponding model. 

Interaction of Routers in a Tree 

In the following paragraphs, we consider the interaction of routers in a tree of routers in an indirect 
n-cube network. The interaction among such routers can be quite complex. We examine one particular 
type of interaction and its effect on the behavior of the network. While this interaction does not seem to 



Table HI. Complete Network Simulation 
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havc an important effect on the overall throughput of the network, this interaction may cause a few of the 
routers connected to the network inputs to be slow for a long period of time. 

The effect of this interaction is important in networks of modest size, networks with less than 10 
stages. As was mentioned above, this interaction may cause a few of the network inputs to be slow for a 
long period of time. We develop a model for this interaction, use the model to estimate its effect, and 
check our estimate by simulating the whole network. Our model suggests that if this type of interaction 
occurred in arbitrarily large trees, its effect would asymptotically grow as the square of the depth of the 
network. In fact, this interaction only occurs in trees of modest size, but it is strong enough to cause the 
input rate for the slowest input of a network with eight or nine stages to be less than half the expected 
input rate for a randomly selected input for a period of forty units of time. 

The type of interaction discussed in the previous paragraph is less important in very large networks, 
networks whose depths arc much greater than 10 stages. Since that type of interaction does not occur in 
very large trees, other factors become more important for very large networks. We briefly consider one of 
these factors. For very large networks, as we shall see, this factor implies that the slowest input router 
requires greater than c^n/(\og2 n) time to accept IB packets where eg is a constant, B is the buffer size, 
and n is the depth of the network. 

Trees in Networks of Modest Size 

In a network of modest size, we examine a particular interaction of routers in a d -stage tree whose 
leaves are connected to the network inputs. We select a rotitcr of the n-d th stage from the final stage of 
the network where n is the depth of the network. We refer to tliis router as the root router of the tree. 
We consider the tree composed of the root router and all of the routers that can direct packets to that 
router as shown in Figure 18. We refer to the routers of die tree in the d-l st stage from die root router as 
the leaf routers of the tree. Below, we study the time required by the slowest leaf router to accept e^Ii 
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Fig. 18. rf-Stage Tree in a n-Stagc Network 




v 



packets where B is die size of the input buffers of the routers and q is a constant. We do tliis using a 
model of the tree. We introduce the model and analyze it in order to estimate the time required for the 
slowest leaf router of the model to accept c^B packets. Wc use simulation to compare our estimate to 
the performance of the model and to compare our estimate to the performance of the d -stage tree. 



Wc assume that conflict in other parts of the network affects the tree only at the root router. It 
seems likely that the leaf routers accept packets more quickly with this assumption than without it. Thus, 
wc make tliis assumption in order to estimate an upper bound on the rate at which the slowest leaf router 
accepts packets. 
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Wc assume that the links connected to the outputs of the root router accept packets very slowly. 
VVc assume that the inputs of the network arc connected to perfect sources that supply packets as quickly 
as they can be accepted. We assume that n, the depth of the network, is no more than 10. We assume 
that the rate at which packets can be accepted from the outputs of the root router is limited by conflict in 
the later stages of the network and is less than the rate at which the root router can stipply packets. 

We arc interested in the behavior of the tree after the network has been in operation for some 
period of time. We randomly choose a point in time after the network has been in operation for a long 
time. We examine the duration of the shortest period beyond that point such that during the period each 
leaf router of the J-stage tree of the network accepts c^B packets where B is the size of the input 
buffers of the routers and ci is a constant. 

Model of a d- stage Tree 

In order to estimate the duration of this period, we study the model of a (/-stage tree shown in 
Figure 19. The characteristics of the model correspond for the most part to the assumptions that we made 
above. The characteristics of the model are such that it seems reasonable to assume that the model will 
lead to a lower bound on the duration of the period. We refer to the model as the rf-stagc special tree or 
simply the d -stage special. 

The J-stagc special is composed of routers, probabilistic sinks, perfect sinks, and perfect sources as 
shown in Figure 19. The routers operate in a manner similar to that of the routers of the network. Both 
of the outputs of the root router arc connected to probabilistic sinks. Kach probabilistic sink has an input 
buffer of size B . If the input buffer of a probabilistic sink is not empty at the beginning of a time unit 
then with probability Cj the sink removes a packet from the buffer where Cj is a small constant. Kach 
router of the tree except the root has one output connected to another router of the tree and one output 
connected to a perfect sink. We refer to the routers of the tree in the d-l st stage from the root router as 



Fig. 19. d -Stage Special 
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the leaf routers of the tree. The inputs of the leaf routers arc connected to perfect sources that produce 
packets as quickly as they can be accepted. 



Analysis of the Model 



We examine the behavior of the d -stage special after it has been in operation for some period of 
time. We randomly choose a point in time after the (/-stage special has been in operation for a long 
period. Wc examine the behavior of the d -stage special after that point. 
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Wc arc interested in the time required after the selected point for the slowest leaf router to accept a 
total of q# 2 packets, lor the purpose of discussion, wc refer to this time as the slowest leaf router 
acceptance time, T ,. As before, wc use the notation E[x] to refer to the expected value of x . Below, we 
estimate a lower bound on F.[T d \ Wc refer to this estimate as eslTj. We develop a recursive definition 
for estT d . The definition is given in equations 99-101. For large d, as wc shall sec, estT d grows as the 
square of d. We arc primarily interested in estT d for d < 10 since we expect the type of interaction 
modeled by the d -stage special to occur only in trees of depth < 10. As is shown below in Table IV, 
estT j grows rapidly even for d in this range. 

In order to make the desired estimate of E[T d ], the expected value of the slowest leaf router 
acceptance time, we consider the behavior of one router from each stage. In each stage, we select the 
router that accepts Lite smallest numbci of pm-k.ei.:> in the T d unit peiiod during which the slowest leaf 
router accepts q/? 2 packets. For < / < d, wc refer to the selected router of die / th stage from the 
root router as the / th intermediate. For < / < d, wc define numbpackets d to be the number of 
packets accepted by the / th intermediate in the T d period. In order to estimate E [numbpackets d ], the 
expected value of numbpackets °y, we consider E[numbpackets d \ - E [numbpackets d \ for each value of 
/ such that < / < d and we make use of the fact that E [numbpackets^' 1 ] is equal by definition to q/i . 
We use E [numbpackets d ], the expected number of packets accepted during the period by the root 
router, to estimate E[T y], die expected length of the period. Wc argue that E [numbpackets d \ is big and 
then assuming that it is big wc argue that E[T d \ is big. 

For each value of / such that < i < d,wc estimate E[numbpackets d ] - E [numbpackets d ], the 
difference between the expected number of packets accepted by the /-l st intermediate and die expected 
number of packets accepted by the / th intermediate, by considering the /-l st intermediate and die two 
routers connected to its inputs as shown in Figure 20. For the purpose of discussion, wc refer to die 
routers connected to the inputs of the /-1st intermediate as the input routers. Wc consider die operation 
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: \g. 20. H st Intermediate and input Routers 
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of these components in the T ^ unit period after the selected point in time. 



The packet tags examined by each of the input routers correspond to a Bernoulli process and the 
packet togs examined by one of the input routers are independent of the packet tags examined by the 
other input router, and we use these facts in the following paragraphs to estimate a lower bound on 
E[numbpackets 'f^\ - E[numbpackcts j]. We argue that during the period one input router is likely to 
receive a larger fraction of packets labeled for the i-\ st intermediate than the other input router receives. 
We then argue mat during the period one of the input routers is likely to accept a smaller total number of 
packets than the other input router accepts. Since the number of packets accepted during the period by 
the / th intermediate can be no more than the number accepted by the slower input router of the i-l st 
intermediate, we estimate a lower bound on F.[numbpackets j ] - E[nuuibpackets fj] by estimating the 
difference between F.[nwnbpackets 'j] and the expected number of packets accepted during the period 
by the slower input router of the ;'-l st intermediate. 
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For the purpose of discussion, we define some notation. We arbitrarily order the input routers. 
For k - 1 and k - 2, we refer to the following quantities, 

PA ^ . the number of packets accepted during the period by the k th input router, 

TPA ^ , the number of packets accepted during the period by the k th input router that arc tagged 

for the / -1 st intermediate, 
TPO ^ , the number of packets output during the period by the k th input router that arc tagged for 

the /-I st intermediate, 
MP A , the minimum of PA ^ and PA ^ 
TPA ' / l , the number of packets that are in the first MP A packets accepted by the k th input router 

and that are tagged for the i-\ st intermediate, 
f k , TPA k /PA k , 
and 

f' k , TPA' k / MP A. 
We also define kmaxtpa to be equal to one if TPA '^ is greater than or equal to TPA '2 and we define 
kmaxtpa to be equal to two otherwise. 

About half of the packets accepted during the period by an input router arc tagged for the /T st 
intermediate. The tag of each packet is independent of the tags on other packets. Thus, the tags on the 
packets accepted by the router can be considered to correspond to a Bernoulli process. The chance that a 
packet accepted by the router is tagged for the /-l st intermediate is (.5). Thus for k = 1 and k - 2, 
T.\TPA '^ ], the expected number of packets that arc in the first MP A packets accepted by the k th input 
router and that arc tagged for the /-l st intermediate, is equal to (\/2)E[MPA ]. 

However, since there are two input routers and since the tags on packets received by the two input 
routers correspond to two independent Bernoulli processes, 

'^ TPA 'kmaxtpa^ OW [A//M ] + c^MPA ]) l/2 (89) 
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for some constant iy 

Thus, /•. [TPA f, maxlm ]< tlic expected number of packets accepted during the period by the 
kmaxtpa th input router that arc tagged for the i-\ st intermediate, is greater than 

(\/2)F[MPA ] + c 3 (K[\fPA ]) l/1 + (l/2)K[PA hmxtpa - MP A ) . (90) 

Assuming small deviations from the means, we assume that ^[f kmaxtpa^' tnc expected value of 

TPA kmaxtpa /PA kmaxtpa • is « rcatcr ^ m 

cAF.[MPA]) l/1 

1/2 + -fez — f (91) 

L V A kmaxtpa* 

for some constant c 4 . We assume that /<;[/M kmaxtpa^ ' E W PA 1 ^ ^-'[^ kmaxtpa 1 and that as a result 

£"[/ /^►>^v-/„/,] > 1/2 + - T7T f° r somc constant Cc. Thus, wc assume that 

limuxipu IV\PA Y\*-' *• 

(L V A kmaxtpa" 

Cr 

r:\ r , l m n a. 2 f 92) 

'■" Kmaxtpa*' "~ ' (F .~ pA ,J/2 v 

(,/, l// /l kmaxtpa" 

for some constant Cg. 

To simplify the discussion, wc assume that die input buffers of die i-\ st intermediate remain non 
empty during the period. The motivation for this assumption can be seen by examining the operation of 
tlic tree during the period. We assumed earlier that the links connected to the outputs of die root router 
of the tree accept packets very slowly. Thus, we assume Unit the root router accepts packets very slowly. 
We expect die i-\ st intermediate to accept packets no more quickly than the root router since the /T st 
intermediate is the slowest router of the ;-l st stage from the root. Thus, wc assume that die input 
routers can supply packets quickly enough that die input buffers of the /-I st intermediate remain non 
empty. 

Since wc have assumed that the input buffers of the ;'-l st intermediate remain non empty during 
the period, the number of packets removed from each of the input buffers of die /-l st intermediate 
during the period is independent of the tag bits examined by the input routers and is thus independent of 
kmaxtpa, and these facts can be used to bound i:\TPO kmaxtpa^' tllc cx P cctcti number of packets 
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output during the period by the kmaxlpa th input router that are tagged for the i-\ st intermediate. 'The 
expected number of packets removed during the period from the kmaxlpa th input buffer of the /-I st 
intermediate is (\/2)E[iiumbpackcis 'f \. However, the number of packets in the kmaxlpa th input 
buffer of the /-I st intermediate at the end of the period may depend on tire tag bits examined by the 
kmaxlpa th input router during the period. Thus, the following relation holds for I'[TPO kmax i IKl \ 

(UDI-lnumbpackels^yn < IWO kmaxtpa \ < {\/2)E[numbpackcls ^' { } + B . (93) 

Based on the arguments of the previous paragraphs, we estimate a lower bound on 

E[numbpackcts L ] - E[numbpackcls j], the difference between the expected number of packets 

accepted during the period by the ;'T st intermediate and the expected number of packets accepted 

during the period by the / th intermediate. Clearly, E[TPA kmaxtpa \ the expected number of packets 

accepted during the period by the kmuAiyu th input louicr and tagged for the i -1 si iuieinieuiaie, is less 

than or equal to E[TPO kmaxlpa \ + IB. From (93) and (92), wc have I^ITPO kmaxtpa ] < 

(l/2)E[numbpackcts £ l ]+ B and E\f h ] > 1/2 + ^ -^ ■ Assuming small 

V'V HA kmaxli>aU 



\axlpa J 

I* \ T 7 P t 1 

deviations from the means, we assume diat E[PA kmaxWa \ = ~~]7T7 — m!Xlp( ! . Thus, wc conclude 

^ W kmaxlpa ' 



that 



PIP A k l 'l 7PA kmaxtpgl m) 

L V A kmaxlpa^ t v • l ' 

1/2 + 2 

WT pA kmaxlpa^ 

(]/2)l:[iwmbpackelsy l ] + 3B 

* V >A kmaxtpa 1 < 7T < (95) 

1/2+ 



{(\/2)F[iiumbpackels i d ' l \ + ^B) X/1 



c f /'.' [numbpackcls , 
IB - 



/-li 



. . {(\/2)i:[numbpackels' ij ' [ \ + },B) L/2 

W A kmaxtpa^ ^umbpackets J" 1 ] + ~ , (96) 

1/2 + —. rjz 

({\/2)E[numbpackcls y] + }BY /l 



and 
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l -V >A kmaxtpa 1 < I : -[na»<bpackcts <[ ^-cjiElnumbpackcts d ' ] \) l/2 (97) 

for some positive constant c-j. Clearly, l'.[nunibpackels 'A die expected number of packets accepted 
during the period by the / th intermediate, is no greater than the expected number of packets accepted 
during die period by the slower input router (the slower of the two routers connected to the inputs of the 
i-\ st intermediate) and thus is no greater than h [PA j imax(Da ]■ Thus, E[numbpackets d ] - 
E[munbpackets d ]> c-j(E[nunibpackets j \) . Since E[numbpackets 'J ]> E[numbpackets fj], 

E[nu»tbpackels d ] - E[numbpackets d ] > c-j{E.[numbpackets d ]) . (98) 

Based on the arguments of the previous paragraphs, wc estimate a lower bound on 
E[numbpackets 'A the expected number of packets accepted during the period by the / th intermediate. 
We use the notation estnpkts j to refer to the estimate for a lower bound on E[numbpackets fj]. We 
define estnpkts d recursively. The basis comes from the fact that, by the definition of the period, 

/I 

E[numbpacketsj ] is equal to c^B . The recursive step comes from the discussion of the previous 
paragraphs. Thus, we define 

estnpkts^' 1 = c x B 2 (99) 

and forO< /' < d, 

eslnpkts ^ = estiipkts d + c-jicstnpkts ^) 1/2 . (100) 

For very large values of d, estnpkts • grows roughly as the square oft/; we are primarily interested 
in values of d less than 10, but estnpkts j also grows rapidly for d in this range. Below, we assume 
example values for q, c-j, and B , and compute estnpkts j for values oft/ less than 10. The results arc 
listed in Table IV. 

Wc use estnpkts d to estimate a lower bound on E[T d ], the expected value of the slowest leaf 
router acceptance time. Since the /croth intermediate, the root router, is expected to accept 
E[numbpacketSj] packets during the period, wc assume that it is expected to output greater than 
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(\/2)(l:'[mwibpackctSj] + c^(T[nunibpackcls j)) ) packets on one of its outputs during the period for 
some constant eg. We assume that it takes (\/2)(\/ c^F.lmtmbpackets j\ + c^(T.[iiuiiibpuckets j]) ) 
time for the probabilistic sink connected to that output to accept the packets where Cj is the parameter of 
the probabilistic sink. We use the notation estT j to refer to the estimate for a lower bound on I''[T d \. 
We define 

estT d = (l/2)(l/c 2 )(estnpkls° d + c%(estnpkls d ) l/2 ) . (101) 

As wc shall sec in the example below, estT d , our estimated lower bound on the time required for 
the slowest leaf router to accept a total of qS packets, grows rapidly with d. As wc shall see, estT d for 
d equal to nine may be several times the size of estT • for d equal to one. 

Fvnhintinn nf the Model 

Wc use simulation to evaluate how well the behavior of a special tree corresponds to our estimates, 
and how well the behavior of a tree in an indirect n-cubc network of modest si/c corresponds to the 
behavior of a special tree. As is discussed below, our simulations suggest that the time required by the 
slowest leaf router of a d-stagc special to accept cyB packets grows at least as fast as estT d defined 
above (101), and our simulations suggest that the slow leaf routers of a tree in an indirect n-cubc network 
of modest size are at least as slow as the slow leaf routers of a special tree of corresponding size. 

We simulated special trees of various depths and examined the slowest leaf router of each tree in 
order to compare its behavior to our estimates. In the simulation, the size of the buffers, B , was equal to 
five, t'2- the parameter of the probabilistic sinks, was set to (.35). A simulation run was made for each 
tree depth. Fach simulation run was divided into many periods. Hach leaf router accepted more than 25 
packets during each period. For each period of each simulation run, die lime required by the slowest leaf 
router to accept a total of 25 packets was measured. The average over all the periods of a simulation run 
was computed. The results of the simulation runs arc shown in Table IV. For each simulation run, the 
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Table IV. F.valuation of estT ^ 

d 1 2 3 4 5 6 7 8 9 10 

7 j from J6.3 45.8 51.1 61.1 79.5 88.5 94.1 112.2 
simulation 

eslT d 41.1 47.0 53.2 59.8 66.8 74.3 82.1 90.3 99.0 

estnpkts d 25.0 28.8 32.9 37.2 41.9 46.8 52.0 57.5 63.2 69.3 

B=5, c l B 2 = 25, c 7 = .76, c 2 = .35, c g = .76 



average time for the slowest leaf router and the depth of the tree, d, are listed. 

Fur cumpciiisuii with the special lieu simulation results, we computed esu d and esinpkisj for 
various values of d . Wc assumed that cy was equal to eg and that eg was equal to (.76) and we assumed 
that eslnpktsj'^ was equal to 25. The computed values arc listed in Table IV. The simulation results 
seem to grow at least as fast as eslT j. 

We simulated complete indirect n-cube networks of modest size in order to compare their slow 
input routers to the slow leaf routers of special trees. Networks with buffers of si/.c one, two, and five 
were simulated. In the simulation, the outputs of the networks were connected to perfect sinks. A 
simulation run was made for each combination of network depth and buffer size. For each buffer size, 5, 
a value of c, was selected. Hach simulation run was divided into many periods such that each input 
router accepted more that cj/i packets in each period. For each period of each simulation run, the time 
required by the slowest input router to accept a total of q B 2 packets was measured. The average over all 
the periods of a simulation run was computed. The results of the simulation ains are shown in Table V. 

For comparison with the complete networks, wc performed additional simulation of special trees. 
We simulated special trees with the same buffer sizes as the complete networks. Wc simulated a special 
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Tabic V. Evaluation of the Special Tree Model 

n = t/+4= 1 

d = 
H = l 
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where 7 ' . is the time required by the slowest input router of an indirect n-cubc network 

to accept a total of Ci B packets 
IR is the average total input rate of an indirect n-cubc network 
and T s i ow is the time required by the slowest leaf router of a rf-stagc special 

to accept a total of q B packets 
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tree for each combination of depth and buffer si/.c. For each special tree simulated, if d was the depth of 
the special then we set the parameters of the special to correspond to the parameters of a (c/ +4)-stagc 
network with the same buffer size and with its outputs connected to perfect sinks. We did this in order to 
evaluate how well the d -stage special represented d -stage trees in the first d stages of a (<7+4)-stage 
network as shown in Figure 21. As was discussed earlier, the behavior of a special tree is intended to 
represent the behavior of a tree in the first stages of a complete network. We chose to compare the 
d -stage special to d -stage trees of a (d +4)-stage complete network. While the exact choice of J +4 was 
not critical, it was important to consider a network that corresponded to the assumptions of the special 
trees. In particular, it was important to consider a network large enough that the rate at which packets 



Fig. 21. rf-Stage Tree in a (d +4)-Stage Network 




rf+4 
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were accepted from the roots of the </ -stage trees in the first J stages of the network was low. The value 
of q used for the special was equal to the value of Cj used for the complete network. The value of Cj- the 
parameter of the probabilistic sinks of the special, was selected to roughly correspond to the rate at which 
packets could be accepted by an input of a router of the third from the last stage of the complete network. 
We estimated this rate by considering a four stage complete network with the same buffer size and with 
its outputs connected to perfect sinks, and by measuring the average rate at which its inputs could accept 
packets. A simulation run was made for each special tree. Fach simulation run was divided into many 
periods such that if B was the buffer size, each leaf router accepted more than q/i packets in each 
period. For each period of each simulation run, the time required by the slowest leaf router to accept a 
total of q/? 2 packets was measured. The average over all the periods of a simulation am was computed. 
The results are listed in Table V. The simulation results for die special trees do not seem to grow any 
faster than the simulation results for die corresponding complete networks. 

Table V also lists simulation results that can be used to compare the input rate of the slowest input 
router of an indirect n-cube network to the total input rate of the network. For each simulation run, die 
table lists c, B 2 divided by twice the average, over all of the periods of the run, of the time required by 
die slowest input router to accept q# 2 packets. This quotient gives an indication of the rate at which the 
slowest input router accepts packets on each of its inputs. For each simulation run, the table also lists the 
average total input rate of die indirect n-cube network divided by the number of network inputs. These 
results suggest that the input rate of die slowest input of an indirect n-cube network can be several times 
slower than die expected input rate of a randomly selected input. 

Trees in Very Large Networks 

In very large indirect n-ctibe networks, n much larger than 10, the interaction between a large tree 
and the rest of the network docs not correspond to the assumptions that we made for modest trees in 
modest networks, n less Hum 9. In a very huge tree, as shown in Figure 22, conflict in die earlier stages of 
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the tree causes routers in the Liter stages to receive pa> 'ets very slowly. 'The rate at which a router in one 
of the later stages of the tree receives packets on its inputs is not likely to be larger than the rate at which 
packets can be accepted from its outputs. The input buffers of such a router may often be empty. As a 
result, Lite conclusions that we drew about the slow leaf routers of modest trees in modest networks do not 
hold for the slow leaf routers of very large trees in very large networks. For d much larger than 10, wc do 
not expect the slowest leaf router of a J-stagc tree in an indirect n-cube network to require estT j time to 

accept ci /? 2 packets. Wc do not expect such a router to require time proportional to J to accept c^B 

2 
packets and estT ^ is proportional to d . 

However, the time required by die slowest input router of a very large network to accept a constant 
number of packets is a function of the network's depth. In particular, if the depth of the network is n, we 



Fig. 22. d -Stage Tree 
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r 9 n 

expect die slowest input router of die network to require greater than -. . time to accept IB packets 



for some constant Cq. 



The motivation for this can be seen by considering die network and its operation. For the purpose 
of discussion, we define 

Can 
/i(log 2 n) 

We consider the routers in the (log 2 K)-i st stage from die network inputs. For the purpose of 
discussion, we refer to the (log 2 K)-\ st stage from die network inputs as the selected stage. The network 
inputs can be divided into (N/A') groups with K inputs in each group such that the set of routers of the 
selected stage diat can receive packets from one group of inputs is disjoint from the set of routers of the 
selected stage that can receive packets from' any other group of inputs. 

We consider the operation of the network after some randomly chosen point in time and argue that 

1 c 9 n 

with high probability at least one of the inputs accepts less dian cyB packets in a period of,, , units 

of time after the selected point. We define P to be the chance that the first IB K packets to arrive on a 

group of inputs must pass on the same output of the same router of the selected stage. Thus, 

P =K{\/k9 BK , (103) 

P>(\/K) 2BK , (104) 

and 

( fl(1og 2 n) )3flAr ( fl(log 2 n) )yjA . 

P = ^— r: = %r- . (105) 

Wc define P' to be equal to the chance that, for at least one of the groups of inputs, die first 3# K packets 
received by that group are tagged for the same output of the same router of the selected stage. Thus, 

P' = b(l-P) N/K . (106) 

Using die Taylor series for log (l-/>), 

r = i. c -(N//:x5: / = lto00 (i//)f"') i (107) 
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/>• >1 . c -(N/AT)/' i (108) 

and 

/>' = 1-c'^ (109) 

where fjc/j is equal to(N/A')/\ Thus, 

PJC p = (N/A)/> 

= (N i-3c 9X ^5l!l> ) 3/?/r (l/A:)i (no) 

c 9 

If c 9 is much less than 1/3 and N is large then dierc is a good chance that for at least one group of inputs, 

all of the first IB K packets received by that group arc tagged for the same output of the same router of 
the selected stage. In such a case, since only B(2K-2) packets can be buffered between tfiat group of 
inputs and the output of the router of the selected stage, die operation of the router of die selected stage 
affects that group of inputs. Since there are K inputs in the group and since only one packet can be 
transferred per unit time on the output of the router of the selected stage, at least one of the inputs will 

accept less than 3Z? packets in a period of B K units of time. In other words, at least one of the inputs 

cqn 
will accept less than }B packets in a period ofjrj-f — : units of time. Thus, if c^B is greater than three 

1 c 9 n 

then at least one of the inputs will accept less than qZ? packets in a period of,, , units or time. 

2.3 Networks for Systems with Localized Communication 

Many localized communication patterns can be supported with networks that arc less complex than 
the uniform communication networks described above. There are, of course, a wide variety of localized 
communication patterns. While we have not done extensive work on this topic, we describe in this 
section an obvious family of network structures that seem appropriate for some important localized 
communication patterns. 

In the technologies of this chapter, there is a large class of systems such that each system can be 
supported by a network that has a cost linear in its number of inputs. Since we are assuming in this 
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chapter that all wires ha\c the same cost independent of their length, the factors affecting the cost of a 
network are simply the number and complexity of network nodes, and the number of interconnecting 
wires. If each source of packets in a given system generates packets for a number of destinations that is 
less than <.• where c is some constant independent of the total number of sources then the communication 
requirements of the system can be supported by a network with a number of nodes and wires 
proportional to the number of inputs. Such a network can be constructed in the obvious way by 
associating a node with each input and a node with each output, and connecting each input node to the 
output nodes that correspond to possible destinations for packets from that input. The total cost of such a 
network is independent of the identity of the output nodes that must be connected to a given input node. 
It is important to note that this will not be the case in the technologies of the next chapter. 

Ouo lii'iOdi" cost network suuUuio is uie giid suueuue. We consider a grid of two dimensions as 
shown in Figure 23, but grids of higher dimensions are also useful. Each node of the grid is connected to 
the nodes adjacent to it. Kach node is also connected to a network input and a network output. Clearly, 
the cost of such a network is linear in the number of network inputs. Such a network can obviously be 
used in systems that support computations on grid structured data such tiiat the computation on a given 
grid element involves only the adjacent grid elements. 

Another linear cost network structure is the tree structure. In such a structure, the nodes are 
connected in a tree as shown in Figure 24. The inputs and outputs of the network can be connected in at 
least two possible ways. One way is to connect each node of the network to a network input and a 
network output. Another is to connect only the leaf nodes to the network inputs and outputs. In the 
discussion below, we assume that each node of the network is connected to a network input and a 
network output. 

A tree network can be used to support applications, such as divide and conquer algorithms, that 
require hierarchical communication patterns. A tree network can simultaneously support communication 
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Tig. 23. Grid Structure 




between the PC's in each pair of adjacent leaf PC's. Thus, it can support a high total bandwidth for such 
communication. But the network has half that bandwidth for supporting communication that must go 
through the PC's of the second stage from the leaves. In general, if for some / the network has some 
bandwidth for packets that must go through the /th stage then it has half that bandwidth for packets that 
must go through the / + lst stage. Thus, the tree network can be used to support some systems that 
require hierarchical communication. For example, the network can obviously be used in a tree structured 
system where each module is the root of a subtree of modules that it controls and where each module 
requires the same communication bandwidth. 
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V\g. 24. Tree Structure 




-87- 

3. Design of Routing Networks Considering the Cost of Wires 

3.1 Introduction 

In this chapter, we examine the changes that will probably occur in integrated circuit technology in 
the next five to ten years, and examine the design of routing networks assuming such changes. We refer 
to the resulting technology as very large scale integration (VLSI). 

In the next five to ten years, we expect several improvements in integrated circuit technology. We 
expect a reduction in die width of wires and the size of transistors. The minimum width of wires may 
shrink to roughly 1/(2) to 1/(2.4) of its present value. The area required at the end of this period to 
implement a circuit may be 1/2 to 1/6 of the area required at present to implement the circuit. We 
expect the speed of on-chip circuits to increase to roughly two to four times their present speed. We also 
expect that by die end of this period the use of multiple layers of metal will be common. 

While we expect features on a chip to become smaller and the complexity of on-chip circuits to 
increase, we do not expect the overall physical size of the chips to increase drastically in this period. 

The improvements in integrated circuit technology will allow a large number of network nodes to 
be placed on a single chip. As a result, the wires interconnecting the nodes will be on-chip wires. Since 
the chip area required to implement an on-chip wire is proportional to its length, the length of wires in a 
network subsection will be an important factor in the chip area required to implement the subsection. 

However, it appears that for the next five to ten years it will still be possible to drive even very long 
wires quickly by choosing drivers of die appropriate size. The capacitance to the substrate per square 
micron of metal will increase as the diickncss of the oxide layers decreases. We expect that die width of 
wires will in general decrease and that the area of a given length of wire will decrease. The combination 
of increasing capacitance per unit area and decreasing area may cause die capacitance of a given length of 
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wire to rcinain roughly constant. A long wire with a large capacitance can he driven quickly if a large 
driver capable of driving a large amount of current is used. Presently, even a cross chip wire can be 
driven in time comparable to the delay of about ten logic stages by using a transistor whose area is a few 
times that of a minimum size transistor. It is not clear exactly how the area required to implement a high 
current transistor will change as technology changes. The current driving capacity of a transistor is 
inversely proportional to the resistance of its channel. The transistor channel resistance for a particular 
ratio of length to width may increase. But since the minimum channel length will decrease, the minimum 
area for a transistor with a certain current driving capacity may decrease. It seems likely that for the next 
five to ten years it will still be possible to drive a very long wire in reasonable time with a transistor that is 
quite small in comparison to the area of the wire. 

In this chapter, we first describe a model of VLSI that wc will latei use to siudy the area required to 
implement network structures in VLSI. We expect that the dominant component of the total area 
required to implement a network in VLSI will be the area required to implement its wires. The features 
of the wires of the model correspond to what we have assumed above will be the primary characteristics 
of wires in VLSI. The wires of the model require an area proportional to their length. They have no 
propagation time. A signal can be asserted on a wire of the model in unit time. 

We then examine in this model of technology the design of networks for uniform communication 
applications. We examine in the VLSI model the fundamental cost of a single chip network to support a 
certain level of performance for uniform communication applications. Wc examine a few structures that 
seem appropriate for implementing a single chip uniform communication network in VLSI. These 
structures include a crossbar structure, and an indirect n-cubc structure. We discuss a technique for 
interconnecting single chip networks to form larger networks. 

Wc also briefly examine networks for localized communication applications. Wc examine a few 
example network structures and describe the communication patterns that they can support. 
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3.2 VLSI Model 

The model presented here is for the most part the model suggested by Thompson [25J. The items 
implemented on the chip can be broken into two primary types, processing centers (PC's) and wires. All 
processing occurs at the PC's. Transmission of information among the PC's is performed using the wires. 
All switching functions arc performed by the PC's. 

It should be noted that the concept of unit time that we use in die VLSI model is different from the 
concept of unit time that we used in the previous chapter. In the previous chapter, a unit of time was the 
time required to transfer a single packet on a link. In the VLSI model, a unit of time is the time required 
to transfer a single bit of information on a wire. 

Each wire interconnects some number of PC's. A wire has unit width, and a signal (one bit of 
information) may be asserted on the entire length of a wire in unit time. The model characterizes a wire 
as a lumped capacitivc and resistive load with a rise time but with no propagation time. The model allows 
multiple connections to a single wire. It should be noted that die area charged in the model for such 
connections may be too small. The primary reason for this comes from die fact that the model assumes 
that in a VLSI implementation die total area required for all of the drivers of a wire is comparable to, or is 
less than, the area of a wire. In the case of a wire with only one or two drivers, the area required to 
implement die drivers would be quite small in comparison to die area of the wire. However, in die case 
of a wire with a number of drivers proportional to its length, it is likely that the total area required to 
implement die drivers would be at least of die same order as the area of die wire. Thus, diis model can be 
used to obtain lower bounds on the area required to implement a circuit, but the area required for wire 
drivers must be considered carefully in die actual implementation of any circuit requiring a large number 
of connections to a single wire. 
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l-'.ach PC (processing center) is connected to some number of wires. In this model, we assume that a 
PC is square and that the area required to implement a PC is at least as great as /; where // is the number 
of wires connected to the center. This area is of the same order as th.it required to construct a complete 
cross point switch for switching among the wires. If a center docs not need such a powerful switching 
capability, it should be decomposed into smaller centers. We also assume that information can not pass 
through a node in less than one unit of time. 

Wc assume that each piece of input data is available on-chip from a specialized input PC and that 
each piece of output data need only be delivered to an on-chip output PC. We have chosen to separate 
this model from the problem of getting information into and out of a chip. The input and output capacity 
of VLSI chips depends critically on the technology used to package the chips. It is not presently clear 
•yviiicli iCLiiiiOiugios will be used as the scale of integration increases. This topic deserves furdier study as 
the packaging technology advances. 

3.3 Networks for Systems with Uniform Communication 

3.3.1 Wire Cost 

In this subsection, wc investigate the wire area required by single chip networks capable of high 
performance in systems with uniform communication. In particular, we obtain for 1 > / > a lower 
bound on the area required in the VLSI model by the wires of any single chip N-input N-output routing 
network capable of supporting an average throughput of /N packets per unit time when its inputs are 
connected to the uniform communication model sources and its outputs are connected to the non 
blocking model receivers. For this study, each model source is assumed to produce a new packet within 
one time unit of the network's acceptance of the source's previous packet. We assume that the label of 
each packet produced by a model source is independently selected, and that all of the possible destination 
labels arc equally likely. 



Proposition. For 1 > / > 0, fl((/N) 2 ) area is required in the VLSI model to implement any single chip 
N-input N-output routing network capable of supporting an average throughput of /N packets per unit 
time when its inputs arc connected to the uniform communication model sources and its outputs are 
connected to the non blocking model receivers. 

Proof. To get the desired lower bound for routing networks, we use an approach similar to that used by 
Thompson [25] for the discrete Fourier transform and by Abclson [1] for multiplication. This approach 
uses a concept called the minimum bisection width, which wc will define in terms of the graph of a VLSI 
circuit. For any circuit in our VLSI model, the graph of the circuit is defined to be G = (V,E) where V 
contains a vertex for each of the PC's in the circuit, and E is die set of all sets {x,y} such that x and y are 
contained in V and a wire exists between the two PC's corresponding to x and y. The minimum bisection 
width of the circuit is defined to be the smallest b such that for some partition of V into H^ and H2 with 
|I1J < J F I-^l < l H il + 1> tllc deletion of b edges from E can disconnect H^ from H 2 . The minimum 
bisection width of a subgraph is similarly defined. If U is a subset of V for some graph G = (V,E), then 
die minimal bisection width of U in G is defined to be die smallest b such that for some partition of U 
into H| and H 2 with |HjJ < |H 2 | < |HJ + J, the deletion of b edges from E can disconnect H^ from 

H9. Thompson has shown that if the minimum bisection width of some subset of die graph of a VLSI 

2 
circuit is b , dien the area required in the VLSI model for the circuit's wires and PC's is greater than b /A. 

Thus, if we can develop a lower bound on the minimum bisection width of any VLSI circuit capable of 

performing a particular function in a given period of time, we can deduce a lower bound on die area 

required by any such circuit. 

A lower bound on the minimum bisection width of any single chip VLSI implementation of any 
N-input N-output routing network with die desired characteristics can be established by examining the 
communication needs of such a network. Let us consider die graph, G = (V,E), of any such VLSI 
implementation. Wc assume that the VLSI implementation has N input PC's and N output PC's. Wc 
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assume th.it each input PC can produce packets as fast as the network can accept them and eacli output 
PC can accept packets as fast as the network can present them. Wc will examine the case for even N. A 
similar approach can be used for odd N by ignoring one input PC and one output PC. Let O be the 
subset of V corresponding to the N output PC's, and I be the subset of V corresponding to the N input 
PC's. If the minimum bisection width of O in G is b, then there must be a set of b edges whose removal 
would cause one half of die vertices in O to be disconnected from the other half, let Oj and 2 be die 
two disconnected subsets of O that would result from die bisection where |OjJ = |0 2 |. Pet 1^ be the set 
of all vertices in I that would remain connected to any vertex in O^ after bisection, and I 2 be the set of all 
vertices in I that would remain connected to any vertex in 2 . 1^ and I 2 must be disjoint since by 
definition the bisection disconnects Oj and 2 . It follows from our previous assumption regarding the 
average throughput of the network, that in some very long period of duration T the network must be able 
to accept at least /NT packets. The characteristics of the model sources imply diat the expected number 
of packets received during such a period that must be routed cither from inputs in I| to outputs in 2 or 
from inputs in I 2 to outputs in 0^ is greater than or equal to /N 772. Since each wire can transmit only 
one bit per unit time, there must be at least /N/2 wires corresponding to die edges in die bisection. 
Therefore b , the number of edges in the minimal bisection of O in G, must be at least /N/2. 

Based on dicsc results and Thompson's theorem it follows that the area required in die VLSI model 
to implement any N-input N-output routing network with die capacity to support an average throughput 

of /N packets per unit time in die uniform communication model application is S2((/N)"). la other 

2 
words, there exists a constant c such that die area is greater than or equal to c(/N) . 

This ends the discussion of die proposition. I 

If we make certain additional assumptions about die network, wc can obtain a more detailed lower 
bound on the area required to implement the network in die VLSI model. In particular, if wc assume 
that there arc at least p bits in each packet, and if wc continue to assume that the VLSI implementation 
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has N input PC's and N output PC's, then wo can show that R((/j/N)") area is required in the VLSI 
model to implement the network. The argument is similar to the one used above. 

Thus under these assumptions, p and N have the same effect on our lower bound. 

As we shall see later, some of the networks that we present come close to this bound. In particular, 

they can support a throughput of fi(N — r— — ) packets per unit time and require 0((wN) 2 ) area in the 

VLSI model where p is the number of bits in each packet and w is the number of wires in each link of the 

network. Thus, the area of the networks differs from the lower bound by a factor of; — - — . This factor is 

log w 

a result of the fact that we assume that a network node requires log w time to receive and acknowledge a 
group of w bits where one bit comes from each of the w wires of a link. 

In considering these results, it should be remembered that the nature of the communication 
networks required for a particular system depends on the overall design of the system. One important 
issue in the design of a large system is the decomposition of the system into subsystems such that each 
subsystem can be implemented on a single chip in VLSI. A number of factors affect the decomposition 
of the system. The nature of the modules to be interconnected by a network affects the chip area 
required to implement them, and it affects the feasibility of implementing them on the same chip as the 
network. The maximum area of a chip and the maximum number of pins on a chip limit the complexity 
of and the communication bandwidth of a single chip subsystem. 

The decomposition of a system affects the characteristics of the communication networks required 
by the system and thus affects the cost of those networks. To illustrate this, we consider a system with two 
possible multiple chip implementations. We refer to these two implementations as implementation^ and 
implementation^ implementation^ is composed of N single chip modules interconnected by w single chip 
uniform communication networks. In implementation^ we assume that each of the single chip modules 
has w inputs of one wire each and w outputs of one wire each, and we assume that each of the networks 
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lias N inputs of one wire cacii and N outputs of one wire each, implementatioiij is composed of N single 
chip modules interconnected by a single uniform communication network. In implemenlalionj, we 
assume that each of the single chip modules has one input of w wires and one output of w wires, and we 
assume thai the network has N inputs of w wires each and N outputs of w wires each. The lower bounds 
above suggest that a N-input N-output uniform communication network with vv-wirc data paths requires 
more area in the VLSI model than a N-input N-output uniform communication network with single wire 
data paths. In particular, a network with w-wire data paths requires w times as much area as a network 
with single wire data paths. Thus, the total area in the VLSI model required for the networks of 
implementation^ is less than the area required for the network of implementationj by a factor of w. 

In order to further demonstrate the importance of the decomposition of a system into single chip 
subsystems, we consider a third impieiueuieiuuii uf the system of the pievious paiagrapii. We refer 10 this 
implementation as implementation^ implementation-^ is similar to implementation ^ except that all of the 
components of implementation-^ are placed on the same chip. In particular, implementation^ is a single 
chip composed of N modules interconnected by w uniform communication networks. In 
implementation-,, each of the modules has w single wire inputs and w single wire outputs, and each of the 
networks has N single wire inputs and N single wire outputs. In the VLSI model, it can be shown that 
implementation-, requires area of the same order as the communication network of implementation^ 
Thus, implementation^ requires w times as much chip area as that required by the networks of 
implementation^. The discrepancy is due to the fact that some of the interconnections that are 
accomplished by on chip wires in implementation-^ arc accomplished by off chip wires in implementation^. 
In particular, the connections in implementation^ between the modules and the networks require 
n((N»v) 2 ) chip area but the corresponding connections in implementation^ are accomplished by off chip 
wires. 
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Tims, we have shown that the data paths of a single chip uniform communication network imply a 

2 
certain lower bound on the area required by the network in the VLSI model. In particular, fi((/>/N) ) 

area is required in the VLSI model to implement any single chip N-input N-output routing network with 

an average throughput of /N p-hit packets per unit time in the uniform communication model 

application. However, in considering this lower bound it should be remembered that the decomposition 

of a system into single chip subsystems affects the characteristics of the networks required by the system 

and thus the cost of those networks. 

3.3.2 Network Structures 

Introduction 

A number ot" structures exist tor single chip routing networks that are capable of high performance 
for uniform communication applications. These include a simple structure similar to the standard 
crossbar switch as well as the indirect n-cubc stmcture. While the overall areas required by these 
structures in the VLSI model arc similar, many of the other characteristics of these structures differ 
greatly. For example, the N-input crossbar network uses wires with N connections, but the N-input 
indirect n-cubc network uses only wires with two connections. We examine a few network structures and 
examine the characteristics of each stmcture that affect its VLSI implementation. 

We assume that iv-bit wide data paths arc used for each network stmcture. For each stmcture, as 
we shall see, the w width of the data paths results in a w 2 factor in the area required to implement the 
stmcture. The area required for each of the network structures is 0((wN) ) where w is the number of 
wires in each link. 

Thus for each of these structures, much less area is required in the VLSI model for h- networks with 
single wire data paths than for a single network with tv-wire data paths. 
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( rossbar Structure 

The first network structure that we examine is similar to the standard crossbar switch. We describe 
the structure, discuss its complexity, and discuss two of its drawbacks. These drawbacks arc the need for 
many drivers for each long bus and the need for bus arbitration. 

A N-input N-output network built according to this structure is composed of N PC's arranged in a 
grid with an additional N input PC's and N output PC's as shown in Figure 25. F.ach of die PC's in a 
given row is connected to w wires associated with that row. Similarly, each of the PC's in a given column 
is connected to w wires associated with that column. Each row of the grid is associated with one of die 
network inputs. Each column of die grid is associated with one of the network outputs. 'I 'he input PC's 



Fig. 25. Crossbar Network 
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arc connected to the network inputs, and the output PC's arc connected to the network outputs. A packet 
entering the network is initially stored in a input PC. The destination label and data for one packet arc 
transmitted by the input node across one row of the grid. Kacli PC in the grid is capable of determining if 
its associated output column is idle and if its associated input row is presenting data for that output 
column. In such a case, the grid PC connects the input row to the output column, and the output PC 
copies the presented data. Once the output PC has safely stored the packet, the grid PC will terminate its 
connection, and the input PC will present its next packet. Each output PC passes the packets it has 
received over the network output associated with it. 

Obviously, (N 2 + 2N) PC's are required for a N-input N-output network built according to this 
structure. Further, if the area required for each PC is independent of the size of the network, then the 
ovcral! network layout can be done in the giiu like fashion we have described using a total area which is 
proportional to (Ntv) where w is the number of wires in each data path. 

There arc some problems associated with the implementation of very large networks that have this 
structure. The first comes from the fact that each output column wire can be driven by any of the N PC's 
in the column. For the reasons that we discussed earlier, the total area required for the drivers of a 
column wire in an actual implementation may be larger than the area of the wire. The area required to 
implement a N-input N-output crossbar network may grow faster dian N . However in the technology of 
the next five to ten years, we expect that the growth will be close to N . 

In addition, there arc problems associated with the control of die various grid PC's. Only one grid 
PC in a given column should be allowed to drive the column at a given time. One way to accomplish this 
is to view the column as a synchronous bus and to tise a grant signal diat is daisy chained through die 
PC's of die column. Unfortunately, there arc problems with diis scheme. This scheme requires that a 
clock signal be distributed to all the PC's of the column. This scheme can only implement a fixed priority 
of inputs for the column. Further, the grant signal of diis scheme is quite slow since it has to go through 
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Lhc control circuitry of each of the N + 1 PC's of the column. In the VLSI model, H(N) time is required 
for the grant signal to reach the lowest priority input row. This time would seem to be a serious problem 
since we would like each input row of the network to be capable of transmitting packets at a high rate. 
However in a real implementation, the speed of the control circuits in each PC may be much greater than 
the speed of an input row driver. For example, the grant signal may be propagated through a PC in one 
tenth of the time required for a signal to be asserted on an input row wire. For networks of the si/.e that 
we expect in the next five years, networks with perhaps 64 to 128 bit serial inputs, die time required to 
chain a grant signal through all the PC's of a column may be comparable to the time required to send bit 
serially the destination address of a packet along an input row. In such a case, it may be feasible to 
implement a crossbar network with output columns with daisy chained control wires. 

Another possible technique for obtaining the mutual exclusion among die- giid PC's connected to a 
column uses a N-input tree of arbitration units (Figure 26) for each column. Each arbitration unit has 
two incoming request lines from the two arbitration units below it and one incoming grant line from the 

Fig. 26. Arbitration Tree 



.LV 
r g 

arb 
unit 



rg 

A 



V 



rg 

"A" 
V 



two-input subtree 




N-input subtree 



N 
subtree 



~A~ 



y ■ ■ • 
N-input tree 



99 



arbitration unit above it. When the arbitration unit receives a request on either or both of its incoming 
request lines, it will produce a request on its outgoing request line. When the arbitration unit receives a 
grant, either one or both of the arbitration units below it must have a pending request. If there is only 
one pending request, then the unit returns a grant for that request. Otherwise, the unit arbitrarily chooses 
one of the requests and returns a grant for that request. A grant is removed only after its associated 
request has been removed. The N-input arbitration tree associated with a given column has an incoming 
request line from and an outgoing grant line to each of the PC's in that column. The arbitration tree 
ensures that only one PC in a column can receive a grant at a given time. Unfortunately in the VLSI 
model, fi(log N) time is required for the request from a PC to receive a grant from the arbitration tree. 
Since we would like each input row of the network to be capable of transmitting packets at a high rate, 
die delay of the arbitration tree would seem to be a problem. However in a real implementation, the 
speed of an arbitration unit may be much greater dian the speed of an output column driver, and the total 
delay of an arbitration tree may be comparable to the delay of an output column driver. It should be 
noted that if the N PC's of a column arc in a straight line, the arbitration tree for the column may require 
area proportional to N(log N). The best layout that we know for a crossbar network with arbitration trees 

7 7 7 

requires fi(N z w (log N) + N w ) area in the VLSI model. 

Other techniques exist for maintaining mutual exclusion on die output columns. For example, each 
column wire can be viewed as a broadcast medium, and the mechanisms that have been developed for 
conflict resolution in broadcast networks can be applied. Mechanisms of this sort have been studied 
extensively in die literature [18]. Unfortunately, these mechanisms seem to require complex strategics for 
determining when a message should be retransmitted after a collision. Thus, these mechanisms seem to 
require a radicr complex input PC for each input row. 

The performance of a crossbar network depends, among other things, on the technique used for 
obtaining mutual exclusion on the output links. For uniform patterns of communication the crossbar 
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N 
network can support a throughput of fl( ) packets per unit time where w is 

(log w) 4- (arbitration time) 
the number of wires in cacli link and p is die number of bits in each packet. The denominator 

corresponds to the time required to handle a single packet, which is the time required to gain control of 

the necessary output column plus the time required to transfer the packet. Wc assume that log w time is 

required to transfer vv bits through a PC where one bit comes from each of the w wires of a link. 

Indirect n-Cube Structure 

The indirect n-cube (InC) structure, which we described in the last chapter, can also be used to 
construct a single chip routing network capable of achieving high throughput in applications with 
uniform communication. This structure has a number of characteristics that make it interesting for VLSI. 
The inC network requires only two connections to each wire and thus avoids most of the implementation 
problems associated with the crossbar network. As we shall sec below, the area required in the VLSI 
model to construct a N-input N-output InC routing network is 0((N w) ) where w is the number of wires 
in each data path. 

One possible approach for laying out a N-input N-output InC network with vv width data padis in 
0((Nvv) ) area is shown in Figure 27. It should be noted that die figure shows the case for w equal to 
one. Layouts for w not equal to one can be obtained by replacing each wire in the figure with a group of 
w wires. A N-input network is constructed from two (N/2)-input networks and N/2 two-input routers. 
Wc assume diat at least two layers exist and thus crossovers arc possible. The first output of the first 
router is connected to the first input of die first component (N/2)-input network. The last output of the 
last router is connected to die last input of the second component (N/2)-input network. Other 
connections between router outputs and inputs to die component routing networks are accomplished 
using (N-2)w vertical wires. In particular, the ith set of vv vertical wires connects the second output of 
die ( / + 1)/2 router to the ( / + 1)/2 input of the second component network if / is odd and connects the 
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Fig. 27. Layout for an InC Network 
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first output of the (j'+2)/2 router to the (/+2)/2 input of the first component network if / is even. 
Thus, the area required to implement a N-input network, A(N), is less than 2A(N/2) + q(Nw) for 
some constant c^. This recurrence implies that A(N) is less than 



2c 1 (Niv) i + c 2 N 



(HI) 



for some constant Cj- This can be verified by substituting 2cj(Nw) + c-^H for A(N) into the inequality 



We get 



A(N) < 2A(N/2) + cj(Nivr 



2 Cl (Nvv) 2 + c 2 N ? 4 Cl (Nw/2) 2 + 2c 2 (N/2) + q(Nw) 2 



(112) 



(113) 



or 



2q(N>v) 2 + c 2 N = 2q(Nw) 2 + c 2 N 



(114) 
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Wliilc Lhis layout is certainly not the smallest possible, it does demonstrate that the InC network can be 
implemented in 0((Niv) ) area. 



It should be noted that there is an imbalance between the wires and the PC's of the InC network. 
In the VLSI model, the InC network requires only (N/2)(log2 N) PC's but it requires G((N>v) ) area for 
its wires. In addition, the PC's of the InC network are complex. Thus, it seems that most layouts for the 
InC network will not be homogeneous but will instead contain separate areas for wires and PC's. 

The throughput of indirect n-cubc networks for uniform patterns of communication was discussed 
in the previous chapter. The strongest constraint that we studied in the previous chapter still allows a 
throughput of fl(N) with the model of time of the previous chapter where N is the number of inputs. In 
the VLSI model, this would suggest a throughput of fl( n .™ — ;) packets per unit time where p is the 

/? V iCg Yr j 

number of bits in a packet and w is the number of wires in a link. The : factor comes from the fact 

' log w 

that we assume (log w) time is required for a PC to process w bits where one bit comes from each of the 
w wires of a link. 



While N-input InC networks with vv-wire data paths and N-input crossbar networks with tv-wire 
data padis both require G((Nw) ) area in the VLSI model, there are some important differences between 
the two networks. The N-input InC network has only (N/2)(log 2 N) PC's, but the crossbar has (N +2N) 
PC's. The nodes of the InC network arc more complex dian the nodes of the crossbar network. Each 
node of the InC network requires a buffer on each of its inputs and requires a control circuit that is more 
complex than the control circuit of a node of the crossbar network. As a result, more area is required in 
the VLSI model to implement a node for the InC network than to implement a node for die crossbar 
network. 
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Other Network Structures 

There may well exist structures that arc better suited for single chip high performance uniform 
communication networks than the InC structure and the crossbar structure. We have discussed the areas 
required in the VLSI model by the crossbar network and the InC network. There arc problems with both 
the VLSI implementation of the crossbar network and the VLSI implementation of the InC network. As 
wc have discussed earlier, each of the column wires of the crossbar network must be driven by a large 
number of PC's. The InC network has an imbalance between its wires and its PC's. 

We have examined two other network structures. For N-input N-output networks, these structures 
require 0(N ) PC's and have two PC's connected to each wire. The restriction on the number of 
connections to a wire ensures that these structures do not have the problems associated with 
implementation of multiple drivers for a single wire. Further, these structures require roughly the same 
total area for PC's as they require for wires, and these structures have simple regular layouts. 

One network structure that wc have examined is the forest network. The N-input forest network is 
composed of 2N large trees. N of these trees are N-output switch trees, and the other N trees are N-input 
merge trees. A N-output switch tree is constructed from two (N/2)-output switch trees and a switch as 
shown in Figure 28. A switch is a device with one input and two outputs, and has the capacity to buffer 
some number of packets. The switch routes a packet according to the packet's destination tag. A 
two-output switch tree is simply a switch. Similarly, a N-input merge tree is constructed from two 
(N/2)-input merge trees and a merge as shown in Figure 29. A merge has two inputs and one output and 
some amount of internal buffering. The merge funnels all packets on its inputs to its outputs. Packets arc 
output in the order in which they are accepted, and inputs arc examined in a round robin fashion. A 
two-input merge tree is simply a merge. 
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Fig. 28. Switch Tree 
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Fig. 29. Merge Tree 
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Each input of a N-input N-output forest network is connected to the input of a separate N-output 
switch tree, and each output of the forest network is connected to the output of a N-input merge tree. For 
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each / and j such that 1 < / < N anil 1 < ./' < N, the y th output of the switch tree associated with the 
;'th network input is connected to the /th input of the merge tree associated with the y'th network output. 
Thus, each switch tree sorts the packets from a given input according to their destination addresses, and 
each merge tree collects all packets destined for a given output. 

Unfortunately, we have not found a good layout for the forest network. Wc have found a simple 
layout, Figures 30-32, that requires 8((Nw log N) 2 ) area in the VLSI model. Wc have not shown that 
this layout is the smallest possible. It should be noted that the figures show the case for w equal to one. 
Layouts for w not equal to one can be obtained by replacing each wire in the figures with a group of w 
wires. 

We have studied another network that is related to the crossbar network. Wc call this network the 
checkerboard network. The checkerboard network, shown in Figure 33, has a grid structure much like 
that of the crossbar network with a row corresponding to each input and a column corresponding to each 



Fig. 30. N Input N Output Forest Network 
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Fig. 31. 0,n Forest Subsection 
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output. But unlike the crossbar network that uses one set of w wires to connect the PC's in a given row, 
the N-input checkerboard network uses N sets of w wires. (N-l) sets of w wires connect adjacent grid 
PC's in the row, and one set of w wires connects the input PC to the nearest grid PC in the row. 
Similarly, the N-input checkerboard network uses N sets of w wires to connect the PC's in a given 
column. (N-l) sets connect adjacent grid PC's in the column, and one set connects the output PC to die 
nearest grid PC in the column. Thus, each wire in the checkerboard network is connected to only two 
PC's. 

The grid PC's of the checkerboard network, unlike the grid PC's of the crossbar network, have the 
capacity to buffer some number of packets. A packet is transferred from an input to an output by passing 
it along a path of connected PC's between the two. 

The N-input checkerboard network, like the N-input crossbar network, requires (N + 2N) PC's, 



and can be laid out using a total area which is proportional to (Nw) 
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Fig. 32. L, n Forest Subsection (/ < n) 
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Althougli tlic throughput of tlic checkerboard network in uniform communication applications may 
be very good, the time required for each packet to be transmitted through die network is very long. In 
particular, if die inputs of a N-input checkerboard network arc connected to model uniform 



Fig. 33. Checkerboard Network 
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communication sources, then the expected number of PC's in the path through the network taken by a 
randomly arriving packet must be S2(N). This implies that die average delay for a packet is fi(N (log w)) 
units of time. This long average delay time is the primary weakness of the checkerboard network. 



In the technology of the next five to ten years, the checkerboard network seems less interesting than 
the crossbar network. In this technology, it should be possible to build single chip crossbar networks and 
single chip checkerboard networks of approximately the same size. The two networks arc capable of 
similar throughput for uniform communication applications, but the expected delay dirough the 
checkerboard network is much greater than the expected delay of the crossbar network. However, it is 
difficult to predict the changes that will occur in technology in the more distant future. Networks such as 
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the checkerboard network may become more important as chip complexity increases and the feasibility of 
the multiply driven wires of the crossbar network decreases. 

3.3.3 Multiple Chip Networks 

If the network required by a particular system can not be implemented on a single chip then, 
obviously, the network must be implemented as an interconnection of several chips. One technique for 
constructing a large composite network involves the interconnection of several single chip networks. For 
the purpose of discussion, we refer to the component single chip networks of such a composite network as 
the subnetworks of that network. The subnetworks of a composite network are mounted on circuit 
boards and are interconnected by board wires. Thus, the issues involved in the interconnection of the 
subnetworks of a composite network arc similar to the issues involved in the interconnection of die nodes 
of a network of the previous chapter. In particular, the length of the wires used to interconnect the 
subnetworks has less effect on the overall cost of the composite network than the number of such wires 
and the number of subnetworks. Thus, interconnection patterns similar to those used in the previous 
chapter may be appropriate for interconnecting the subnetworks of a composite network. Tn particular, 
an interconnection similar to the indirect n-cubc structure seems interesting. This interconnection has the 
form shown in Figure 34. If a -input a -output subnetworks arc used then the composite network 
contains log N stages of subnetworks. For the purpose of discussion, we number the stages from the 
network inputs to the network outputs with the stage connected to the network inputs being the zeroth 
stage. The /th stage is divided into a ' groups of subnetworks. Fach group of the (th stage has a 
associated groups in die ( / + l)st stage. The j th output of each subnetwork of a group of the / th stage is 
connected to the j th associated group of the ( i + l)st stage. 

If a composite network is constructed from single chip indirect n-cube networks in die manner 
described in the previous paragraph then the composite network is an indirect n-cubc network. If a 
composite network is constructed from single chip crossbar networks in the manner described in the 



Fig. 34. N x N Composite Network 
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previous paragraph then we expect the average throughput of the composite network to be at least as 
large as that of an indirect n-cubc network of the same size. 



3.4 Networks for Systems with Localized Communication 

Many localized communication patterns can be supported by networks in VLSI that are 
substantially cheaper than a uniform communication network. However, the design of networks in VLSI 
for localized patterns of communication must take careful consideration of the cost of wires. In the 
technologies of the previous chapter, wc determine the cost of a network by considering the number of 
wires, and the number and size of nodes of the network. In VLSI, the length of each wire must also be 
considered. If each source module of a system generates packets for only a constant number of 
destinations then the communication requirements of the system can be supported in the technologies of 
the previous chapter by a linear cost network. Mowcver, many such systems require networks with 
greater than linear cost in VLSI. For example, the "perfect shuffle" pattern [24] can be implemented 
1.1 « |nt> n niiiTihcr of "'ires proportion?.! to N but it requires G((M/!og N) ) wire area in the VLSI mode! 
[13]. 

In this section, wc describe two obvious but important networks that can be implemented in VLSI 
in area proportional to their number of inputs. While wc do not examine these networks in great detail, 
wc do discuss some of the issues involved in their VLSI implementation. 

The first of these networks is the grid network. We consider two dimensional grids as shown in 
Figure 35. Each grid PC is connected to the grid PC's adjacent to it. If each grid PC is connected to a 
network input and a network output then the number of PC's is obviously linear in the number of 
network inputs and outputs. The area required to implement the wires tliat interconnect a given PC to 
PC's adjacent to it is proportional to the area of the PC. Thus, the total area required for dtc wires of a 
grid network is linear in the number of network inputs. 

In VLSI, one of the biggest issues in the implementation of linear cost networks in general and grid 
structured networks in particular may be the constraint placed by the limited number of input and output 



' ig. 35. Grid Network 
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connections to a chip. Presently, chips with 64 connections are commonly available. In the next five to 
ten years, packaging technologies capable of handling chips with 100 to 200 connections should be 
common. However, it will probably be possible to implement a grid network with several thousand PC's 
on a single chip. This suggests that if a system of modules interconnected by a grid structured network is 
to be implemented in VLSI, it may be better to place some of the modules and a portion of the network 
on each chip than to place the modules and the network on separate chips. If some number of the 
modules and the portion of the grid network required to interconnect them arc placed on a chip then the 
input and output requirements of the chip may be modest. The only signal wires that need to go off of 
the chip arc those wires that connect the grid of the chip to the grids of other chips. These wires connect 
to die PC's along the perimeter of die grid of the chip. 
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The second network that we consider is the tree network. In such a network, the PC's of the 
network are connected in a tree as shown in Figure 36. The inputs and outputs of the network can be 
connected in at least two possible ways. One way is to connect each PC of the network to a network input 
and a network output. Another is to connect only the leaf PC's to the network inputs and outputs. 

A tree network can be implemented in a small area in VLSI. A tree network can be laid out in 
O(N) area as shown in Figure 37. But there are some problems caused by the limited number of 
connections that can be made to a chip. As was the case for the grid network, it may be possible in VLSI 
to implement on a single chip a tree network with a large number of PC's but such an implementation 
could not have an off chip connection for each PC. This suggests that each chip in the VLSI 
implementation of a large tree structured system should contain both a portion of the network and the 
niwdulcs to be coruiCitiu to that pouiuu of iiie liclwuik. Unlike die case foi a giid ueiwoik where die 
perimeter PC's represent only a small fraction of the PC's of the network, the leaf PC's of a tree network 



Fig. 36. Tree Network 
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Fijj. 37. Layout for a Tree with N Leaves 
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represent over half of the PC's of that network. Thus, while it may be possible on a single chip to 
implement a tree structured subsystem with a large number of very small modules, it is probably not 
possible to make an off chip connection for each leaf module of such a subsystem. As a result, it may be 
difficult to decompose a large tree structured system of very small modules into subsystems such that 
each subsystem requires the area of one chip and such that no subsystem requires more connections than 
the number of connections that can be made to a single chip. However, in the technology of the next five 
to ten years this may be a problem only for systems with very small modules. Unless the modules of a 
system arc very small, it is likely that only a few dozen of the modules fit on a single chip and it should be 
possible to provide an off chip connection for each module of each chip. 
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4. Conclusion 

4.1 Summary 

In this thesis wc examined the design of routing networks under two different sets of assumptions 
about the implementation technology. One set corresponds to present (1984) technology where only a 
small number of nodes can be implemented on a single integrated circuit. The other set corresponds to a 
VLSI technology where a large number of network nodes can be implemented on a single integrated 
circuit. 

In present technology, we examined the design of routing networks for systems with uniform 
communication and we briefly examined the design of routing networks for systems with a few particular 
patterns of localized communication. 

We showed that fl(N log N) nodes arc required by any N-input N-output network capable of 
supporting an average diroughput of fl(N) packets per unit time for our model of uniform 
communication. 

We studied in detail one particular routing network, the indirect n-cube routing network, which 
seems well suited for uniform communication and requires 0(N log N) nodes. We examined certain 
important characteristics of the operation of very large indirect n-cubc networks and the effect of these 
characteristics on network performance. 

Wc examined die buffering of packets in front of a slow router. Such buffering involves a tree of 
routers in front of the slow router. Our model suggests diat expected number of packets buffered in front 

( ^M 2) 

of the slow router is greater than 2 2(^i/7 -IN)(H + 1) . \ where IN is the rate at which packets are 
generated on each network input, li is the size of the buffer on each input of each network node, and 
OUT is die rate at which the slow router can accept packets. 
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VVc examined Llic effect that congestion in routers of a given stage of an indirect n-cubc network has 
on the network's performance. We chose to study the effect of congestion in routers of the last stage since 
analysis of the last stage is somewhat easier than analysis of other stages. Wc examined the buffering 
caused by such congestion and the effect of such buffering on network throughput. Our study suggests 
that congestion in a single stage of routers docs not place a severe constraint on the throughput of the 
network. Our study suggests that this type of congestion still allows the normalized throughput of the 
network (the total throughput of the network divided by the number of network inputs) to approach a 
non zero constant as the size of the network goes to infinity, and that even for modest buffer size this 
constant is not significantly less than the normalized throughput of a two-input two-output network. 

Wc examined the effect of the interaction of routers of different stages on network performance. 
V/c studied the "interaction of routeis aiung <x uctwoik path and we studied the inteiacuou of louieis in a 
tree of the network. 

Wc studied the interaction of routers along a network path primarily by simulating a model of a 
network path. Our model reflects the interaction of routers along a randomly selected network path while 
ignoring the interaction between a router on the path and any router not on the path. Our study suggests 
a limit on dre input rate of a randomly selected network path and thus implies a limit on the overall 
throughput of the network. This limit on network throughput is stronger than any of die other limits 
studied. However, this limit still allows normalized throughput to approach a non zero constant as 
network size approaches infinity. 

Wc examined die interaction of routers in a tree of die network. Wc examined one particular type 
of interaction that occurs in trees of modest size, trees of less Uian 10 stages. Our study indicates that this 
type of interaction docs not have an important effect on the overall throughput of the network but it docs 
cause a few of the routers connected to the network inputs to be slow for a long period of time. For 
example, our study suggests that this type of interaction can cause the input rate for the slowest input of a 
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network with eight or nine stages to be less than ha"" the expected input rate for a randomly selected 
input for a period of forty units of time. 

We also briefly considered a factor that has an influence on the speed of the slow inputs of very 
large networks. This factor causes the slowest input router to require S2( . ) time to accept 3/J packets 
where B is the buffer size and n is the depth of the network. 

In summary, our work indicates that for present technology the indirect n-cubc network is a good 
network for handling uniform communication. The strongest constraint on throughput that we studied 
still allows throughput to grow linearly with network size. However, our study also indicates that even in 
indirect n-cube networks of modest si/.c some of the network inputs can be slow for a long period of time. 

We briefly examined one obvious family of networks that are appropriate in present technology for 
some important localized communication patterns. This family includes grid structured networks and 
tree structured networks. 

We also briefly examined the design of routing networks in VLSI. We described a model of VLSI 
based on assumptions about the characteristics of VLSI. The model reflects the fact that in VLSI the cost 
of a wire is proportional to its length. 

We examined the design of uniform communication networks in VLSI. We showed that fl((/"N) ) 
area is required in the VLSI model to implement any single chip N-input N-output routing network 
capable of supporting an average throughput of /N packets per unit time for our model of uniform 
communication. We examined a few structures that arc appropriate for implementing a single chip 
uniform communication network. These included a crossbar structure and an indirect n-cubc structure. 
The crossbar network is probably the most attractive since it has a simple regular layout. However, the 
crossbar network requires long buses with a large number of drivers that must be arbitrated. The indirect 
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n-cubc network docs not require multiply driven buses, but it docs require more complex network nodes 
and may require a more complex layout. We discussed a technique for interconnecting such single chip 
networks to form larger uniform communication networks. 

We briefly examined the design in VLSI of networks for localized patterns of communication. Wc 
discussed the fact that some networks that can be implemented in present technology with a number of 
nodes proportional to their number of inputs require greater than linear wire cost in the VLSI model. We 
discussed two obvious but important networks that can be implemented in VLSI in area proportional to 
their number of inputs, the grid network and the tree network. We discussed the pin out problems of 
both networks for very dense VLSI. We concluded that in very dense VLSI die processing modules of a 
system using cither network should be placed on the same chips as the modules of the network. 

4.2 Suggestions for Further Work 

There arc many areas where further work could be done. Some of these are discussed below. 

There arc interesting open questions concerning the performance of large indirect n-cube networks 
for uniform communication. The strongest constraint that wc studied still allows die throughput of an 
indirect n-cubc network for our model of uniform communication to grow linearly with the size of the 
network. However, we did not prove such a linear growth. Such a proof appears to be difficult. It may 
be possible to obtain a proof if additional constraints arc placed on the operation of the network. As was 
discussed in 2.2.3.1, an approach similar to Pippcngcr's may be effective. 

Clearly, more work can be done on the design of routing networks in present technology for 
localized patterns of communication. There arc of course a wide variety of localized patterns of 
communication and it is probably not useful to try to examine all possible patterns. However, there may 
be interesting families of communication patterns that can be efficiently supported by families of 
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networks, lor example, there may be an interesting family of eommunication patterns that can be 
supported by the family of networks described below. Kach network of the family corresponds to a tree 
network of the same size. As is shown in Figure 38, the network has c links for each link of the tree 
network and a ic x 3c subnetwork for each node of the tree network where c is a constant. The 
subnetwork is some type of 3c -input 3c -output routing network. Different members of the family have 
different values of c. A network of this family may be able to handle c times as much traffic between 
distant nodes as the corresponding tree network. Other families of networks related to the tree network 
may be able to support interesting families of communication patterns. For example, Lciserson [17] has 
studied a more sophisticated family of networks called fat trees that seems to be able to support a wide 
class of communication patterns. 

oiiViilarly, more work can be done uu llio design of single chip routing neiwoiks in VLSI. Il would 
seem that low level implementation issues will continue for some time to be important in the design of 
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single chip routing networks. Thus, a more detailed look at the implementation of crossbar networks, 
indirect n-cube networks, and other potential networks is needed. In order to fairly evaluate data path 
si/.cs for the crossbar network and the indirect n-cube network, and arbitration schemes for the crossbar 
network, it may be useful to examine tentative chip layouts. 

One important issue that has not been considered in this thesis is the issue of real time fault 
detection and fault masking in routing networks. Some related work has been done elsewhere [2, 19], but 
more is needed. Detection of some faults can be accomplished by schemes that use check fields in each 
packet. Some types of fault masking can be accomplished if multiple paths exist between each source and 
each destination. Such paths can easily be introduced in a network such as the indirect n-cubc network 
by adding one or more additional stages. 
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